2025-03-03

An electrical power outage resulted in a cascade of errors that perturbed platform stability and customer service. A detailed explanation of encountered issues, what we’ve learned.

On Monday, March 3rd, 2025 Clever Cloud’s platform became unstable, leading to downtime of varying duration and scope, for many customers. The impact was both on the API itself, and on some services running as add-ons or runtimes.

These events occurred because:

A power outage on one of our data center providers caused a temporary disruption of our services on the PAR6 Availability Zone (AZ) from EU-FR-1 at 14:55 CET
As an indirect consequence, networking traffic was perturbed, the infrastructure experienced I/O and memory pressure between 14:58 and 16:30 CET leading to instabilities and downtime.

Impacts were observed for customers relying on the EU-FR-1 (PAR) region, and remote zones depending on the EU-FR-1 control plane (GOV, OVHcloud, Scaleway, and Oracle) and some privates zones. On-premise zones weren’t impacted.

Status were communicated through : https://www.clevercloudstatus.com/incidents/959

Product	Impact
Runtimes	Applications may have experienced performance degradation and traffic issues due to impacted hypervisors and load balancers
Runtimes deployment	The incident prevented new deployments, upscaling and downscaling operations from succeeding due to the Deployment API being unreachable
Databases (MySQL, PostgreSQL, Redis, MongoDB)	The incident may have caused performance degradation or unavailability for customers whose databases were running on affected hypervisors. Data loss did not occur
Cluster services (Elastic, Couchbase, Cassandra, …)	The service may have been maintained on some nodes, depending on the infrastructure pressure and the technology used. While some clusters were impacted during the incident, others were not
VPN / IPSec	Most of the VPN endpoints were unavailable during the incident
Observability	The service has successfully recovered from the electrical issue, and logs were available

Timeline

Timeline	Component	Description
14:55 CET	Infrastructure	Power outage at PAR6 (Green Data) data center
14:56 CET	Load Balancers	Monitoring advertises the loss of a load balancer - 1st event
14:56 CET	Infrastructure	Monitoring shows a global incident reporting many unavailabilities
14:57 CET	Infrastructure	Infrastructure team opens a crisis conference
14:58 CET	Infrastructure	Power is back, servers and network equipment are restarting (confirmed later by data center incident report)
14:58 CET	Customers	Support team received notification from external monitoring and customers that some websites are down (Console, admin, etc.)
14:58 CET	Customers	Tech team reported that GitLab is also unavailable
14:58 CET	Infrastructure	Tech team created an incident canal #incident-03-03-2025
15:00 CET	Infrastructure	Infrastructure team summoned others on call teams and members on canal #incident-03-03-2025
15:02 CET	Infrastructure	Hypervisors and network equipment have rebooted
15:02 CET	Infrastructure	Several PAR6 hypervisors (12 of them) are indicating that their RAID array are rebuilding
15:05 CET	Infrastructure	Reporting that AZ Monitoring is under heavy load
15:06 CET	Infrastructure	Infrastructure team confirmed that we have retrieved every hypervisors on PAR6 since at least 4 minutes to 6 minutes
15:08 CET	Network	A network provider hosted in PAR6 reports 2 powerline cuts
15:10 CET	Infrastructure	Report from one of our network provider confirming the electrical outage
15:14 CET	Messaging	Pulsar is back up online
15:14 CET	Observability	Pulsar in-memory observability didn’t restart properly
15:18 CET	Observability	Pulsar in-memory observability is online
15:19 CET	Infrastructure	A RabbitMQ node came back to the cluster after we fixed the communication between the hypervisor and the queue
15:27 CET	DBaaS	Main addon databases clusters is correctly restarting
15:27 CET	DBaaS	Check the availability of internal databases clusters
15:29 CET	DBaaS	DBaaS team start to handle manually all ciphered databases which are still down (due to an unavailabilty of the addon provider APIs)
15:30 CET	Infrastructure	Seeing a lot of WireGuard connectivity issues on multiple distributed clusters
15:30 CET	Infrastructure	Asked to priorise GitLab databases restarts
15:33 CET	Infrastructure	Some hypervisors are overloaded, probably due to Vector/Telegraf not being able to push data to Warp10/Pulsar. Started to list all VMs with CPU load > 1 to relieve the infrastructure
15:37 CET	Infrastructure	Check if Ceph/Cellar are okay. Some daemons seem to hang and 1/4 of OSD are still down on PAR6, requests are served correctly
15:38 CET	DBaaS	Asked DBaaS team to prioritize Clever Cloud’s GitLab databases
15:38 CET	Infrastructure	Listed all applications with VMs CPU load > 1, starting to kill them to relieve the infrastructure
15:42 CET	DBaaS	Databases reboot are still lock on some Hypervisor due to a network issue
15:42 CET	Infrastructure	Fixed a network setup issue on a specific Hypervisor
15:45 CET	Messaging	Listed all GitLab and Premium services databases for DBaaS team
15:45 CET	Observability	A global maintenance is set on all monitoring servers in EU-FR-1 to avoid flooding the Orchestration core API with redeploy events
15:45 CET	DBaaS	Asked to reboot PostgreSQL add-on provider in priority
15:49 CET	Deployments	We lacked 2.8k VMs on the infrastructure
15:52 CET	Infrastructure	Some deployments are failing on PAR6 due to unbound on some HV not restarted, no DNS available
15:54 CET	Observability	See timeouts on Vector logs when connecting to Pulsar
15:55 CET	Infrastructure	Unlock first HVs to allow databases reboot (HV iptables rules)
15:56 CET	Messaging	We hit the max connections threshold on Pulsar
15:56 CET	Infrastructure	Vector logs timeouts fixed, by corrective iptables on hypervisor
15:59 CET	Infrastructure	All Hypervisor iptables generated are restored on PAR6 HVs and services have been restarted
16:00 CET	DBaaS	Can now repair databases down impacted by hypervisors network issues
16:03 CET	Infrastructure	Connectivity issues due to conntrack issue (need to purge conntrack table)
16:05 CET	Customers	Some load balancers are unsynced due to RabbitMQ crash
16:05 CET	Load Balancers	Start the rolling restart of unsynced load balancers
16:06 CET	Infrastructure	Some deployments are still failing on PAR6 due to WireGuard problems with the DNS recursor of the AZ, the recursor is dropped from unbound / resolv.conf of all PAR6 hypervisors
16:07 CET	Load Balancers	For some applications it’s not a load balancer issue, but the application have no instances running
16:08 CET	Infrastructure	Dropped UDP conntracks on the hypervisor of PAR6 DNS recursor to recover WireGuard mesh topologies
16:19 CET	Messaging	WireGuard issues on BookKeeper cluster for Pulsar
16:28 CET	Deployments	Purge delayed alerts on Zabbix monitoring
16:30 CET	Deployments	Restart all applications without any VMs
16:34 CET	DBaaS	Elasticsearch add-ons repair is locked due to the unavailability of our LDAP
16:34 CET	Deployments	Reboot of Keycloack
16:42 CET	Customers	Nodes of Couchbase have been reintroduced in their cluster
16:45 CET	Infrastructure	Finished killing overloaded VMs
16:47 CET	Deployments	Stop instances that are blocked and have failed their deployments and redeploy the application
16:47 CET	Deployments	Fix a Ruby error on bastion that prevent to redeploy application
16:58 CET	Deployments	Stacking semaphores in BDS Orchestrator
16:58 CET	Infrastructure	Network issue on Warp10 FoundationDB
17:00 CET	Deployments	Disabled HA replication of some queues to speed up RabbitMQ
17:17 CET	DBaaS	Recovery of our LDAP enable an Elasticsearch add-ons recovery
17:18 CET	Load Balancers	Restart all add-on providers to be able to properly reload last load balancers
17:20 CET	Load Balancers	End the rolling restart of all impacted load balancers at Clever Cloud
17:28 CET	Observability	Zabbix on PAR8 back in action
17:31 CET	Deployments	Still 1.3k VMs to deploy
17:35 CET	Infrastructure	Two FoundationDB’s VMs could not be started because the NVMe was unavailable
17:37 CET	Deployments	Increased concurrent deployments from 300 to 400 to increase recovery speed
17:50 CET	Observability	Cleaning up old hosts in Paris Monitoring to avoid sending too much outdated monitoring events
17:57 CET	Observability	Zabbix on PAR7 back in action
18:00 CET	Deployments	Started to restart all applications still seen as unavailable by Monitoring
18:05 CET	Infrastructure	Restart Sydney orchestration that are unsynced
18:05 CET	Observability	Zabbix on PAR6 back in action
18:18 CET	Infrastructure	Force to stop all VMs in stopping status
18:45 CET	Observability	Remove the maintenance mode of Monitoring
18:53 CET	Deployments	Deploy last application in failing deployment state
18:54 CET	Infrastructure	Investigating under-replicated Kafka partitions on Warp10
18:55 CET	DBaaS	All databases are up and running
19:00 CET	DBaaS	Some Elasticsearch are down again (due to an heavy resources usage on recovery)
19:11 CET	Deployments	Finished to restart all applications still seen as unavailable by Zabbix
19:12 CET	Infrastructure	Starting to recover the Ceph cluster, 2 physical machines don’t have any network, one has a network configuration issue
19:16 CET	Infrastructure	Start the recovery of Ceph cluster
19:23 CET	Core APIs	Restart of Logs API
19:35 CET	DBaaS	All add-ons are OK
20:25 CET	Infrastructure	All Ceph machines are correctly restarted, cluster is recovering 0.3% of degraded objects
20:48 CET	Infrastructure	Ceph cluster has recovered degraded objects, HEALTH_OK
21:20 CET	DBaaS	Last replica down identified was rebuid

Analysis

Power Outage

We’re still waiting for a detailed explanation from the data center operator, but the same day was scheduled a planned maintenance which consisted in testing UPS for a given time. We observed a 5 seconds window between the power cut and the power recovery.

It led to an instant restart of our entire Availability Zone (AZ). Unfortunately, we can’t provide more insights about the origin of the failure for the moment except a faulty component has cause the power cut, which is unacceptable since we’re running at least 2N (most are 2N+1) power adduction with separated power source. A single faulty component shouldn’t have any impact on the two power lines.

We’re expecting a better postmortem from our data center operator.

Networking connectivity issue

Our network relies on multiple transits to ensure both a good Internet connectivity and a good resilience. These transits are spread among three Points of Presence (PoP). During the restart of networking routers, there appears to be an issue with accessing the region via public connectivity, despite these multiple transits. Our observability system was unfortunately located on the lost AZ, so we lost the flows to analyze the traffic during the power cut. Known Issues:

In some cases, when a router reboots, it may start receiving traffic before it knows where to route it. A specific starting sequence may be enforced to avoid this behavior.

Possible Causes of Downtime

A transit partner did not notice that the session was down before the BGP session timed out (a few minutes), so traffic continued to be routed through them even though we were no longer there to handle it. Another transit partner continued to announce our prefixes even though they couldn’t route the traffic to our network.
There is a 3-minutes timeout on the BGP sessions of the routers, with no Bidirectional Forwarding Detection (BFD) to accelerate convergence.

Unclear Areas

In the nominal situation, most of our traffic is gathered at TH2 POP, so we shouldn’t have seen such a significant impact. However, most of our traffic exits through another transit partner through the incriminated AZ (correlate with the one of the possible cause of downtime)
Did the other routers still have announcements (received from the power cut AZ routers) when they received traffic to send to the internet, because the BGP sessions hadn’t yet timed out? (correlate with the one of the possible cause of downtime)

Network issues were gone at 15:15 CET. We run a simulated copy of our networking topology inside the GNS3 emulator. We will quickly replicate this hypothesis to try to reproduce and tests some enhancements.

Messaging Maintenance (Apache Pulsar)

Learning from a previous outage last summer, we invested a lot in this layer and enhancements made likely focused on improving the resilience and fault tolerance of the system. This includes better monitoring, automated failover processes, auto healing and enhanced data replication strategies with better isolation.

By leveraging Pulsar’s disaggregated architecture and the investments made in infrastructure and operational practices, the system was able to recover swiftly from the loss of an AZ. This resilience underscores the importance of continuous improvement and preparedness in maintaining high availability and reliability.

While the service has been wavering for a couple of minutes, reshuffling all connections on remaining AZs, we confirmed the Pulsar’s ability to recover quickly. Yet the proxy layer combined with the metadata management wasn’t enough to quickly accept massive new connections (hundreds of thousands) in a very short time window.

We already identified the problem for metadata management, and along the year we made many improvements which led us to decide we need a radical enhancement for a more scalable solution. Unfortunately this work is not ready yet. We will also enhance the connection management.

Hypervisor IO/CPU pressure, images

The difficulty for accepting massive new connections on the Apache Pulsar messaging layer has led to the buffering of telemetry agents on the Virtual Machines (VMs) running on Hypervisors (servers). Customers VMs embed an observability agent named Vector we chose for its ability to be very efficient and safe. We anticipated these agents buffering would induce an I/O pressure on the hypervisor.

This overload had an impact on small VMs that reached a limit of memory which triggered swapping then a lot of pressure on the hypervisor. We changed our image base format from the last months to enhance the testability of images. Guard rules have been developed to ensure a better process isolation using CGroups inside these new images, preventing small VMs to stress memory. Unfortunately these changes were in evaluation to ensure the stability of the design, but didn’t make it to production for the incident.

As a consequence, these VMs overloaded was seen as unavailable from the monitoring, then rescheduled which causes a chicken and egg situation: we needed the scheduler to perform some massive actions and comply with deployments, but the control plane was performing sub-optimally and the infrastructure struggled to provide available resources.

Also, a few hours before the incident, we deployed multiple enhanced images, with new tooling but fewer available space. This led to deployment errors for some customers. We rolled back some images globally or for some applications, and will fix this issue until next images update.

Orchestration

Initially, we implemented an orchestration system based on an offering model (similar to Apache Mesos), which provided a solid foundation for managing our services. In an offering model, an orchestrator manages the allocation and deallocation of resources, such as virtual machines (VMs), to ensure efficient utilization and meet service demands.

However, the orchestrator can become stuck due to semaphore locking issues during VM assignment for several reasons:

Resource Contention: Semaphores are used to control access to shared resources, ensuring that only a limited number of processes can access them simultaneously. If multiple VMs are attempting to acquire the same resource concurrently, contention can occur, leading to semaphore locks. This can cause the orchestrator to wait indefinitely if the locks are not released promptly.
Deadlocks: A deadlock situation arises when two or more processes are each holding a resource and waiting for the other to release a resource they need. In the context of VM assignment, if the orchestrator and other processes are waiting for each other to release semaphores, it can result in a deadlock, causing the orchestrator to become stuck.
Improper Semaphore Management: If semaphores are not properly initialized, acquired, or released, it can lead to locking issues. For instance, if a semaphore is not released after a VM assignment operation completes, other processes waiting for that semaphore will remain blocked, including the orchestrator.
High Load and Scalability Issues: Under high load conditions, the orchestrator may struggle to manage the increased number of VM assignment requests. If semaphores are used to serialize access to critical sections of code, excessive locking can occur, leading to performance bottlenecks and causing the orchestrator to become unresponsive.
Timeouts and Retries: If the orchestrator encounters timeouts while waiting for semaphores to be released, it may repeatedly retry the operation. Without proper back-off strategies, this can exacerbate contention and further delay VM assignments, leading to a stuck state.

To mitigate these issues, it’s essential to implement robust semaphore management practices, monitor resource usage, and employ strategies to detect and resolve deadlocks. Additionally, optimizing the orchestrator’s algorithms for scalability and efficiency can help prevent semaphore locking problems.

The problem is, while this locking mechanism helps to provide the best determinism in an offering model, it can also lower the rate of VM instance creation. During the recovery process, we had to continuously adjust this locking variable to allow more service to be queued for scheduling.

However, recognizing the need for enhanced capabilities and a broader range of product offerings, we have transitioned over the past few months to a new orchestration model. This shift is designed to offer more performance, more resiliency, advanced features and support a wider diversity of products. The new orchestration is being rapidly implemented as a new backend, and we are making swift progress in integrating it into our existing infrastructure. This upgrade will not only improve our operational efficiency but also enable us to deliver more innovative solutions to our clients. This is already in production for the Logs Drains which behaved as expected during the incident.

Actions

Databases

#63 Enhance replication monitoring for MySQL databases
#290 Enhance providers API to use replication to ease ciphered databases reboot

Runtimes

#856 Enforce guard I/O rules at Hypervisor level
#857 Enhance observability for small profile VM memory pressure

Networking

#832 Ensure OSPF/BGP convergence times on routers
#898 Enhance boot sequence process on routers
#899 Ensure a good resiliency for border routing BGP announces

Orchestration

#900 Ensure a failsafe VM restart capability
#901 Enhance hypervisor sequence of start process
#902 Ensure a better network configuration compliance across upgrades
#864 Better observability for unresponsive VM
#903 Improve AZ monitoring resiliency

Conclusion

Our response to this outage demonstrated significant improvements compared to previous incidents. The team’s actions and coordinated efforts mitigated the impact and expedited the resolution process. Most of the challenges we encountered were related to ongoing projects and enhancements that are nearing completion. These issues have provided valuable insights that will further strengthen our systems and processes.

We are confident that the measures we are implementing will prevent similar occurrences in the future and enhance our overall reliability and performance. We’re sincerely sorry for the inconvenience caused to our customers, and we’re committed to provide the best quality of service. We plan to organize a meeting for every demanding customer to explain both the incident and our actions.

Last updated on April 9, 2025

Did this documentation help you ?

2024-08-02