2025-03-03
An electrical power outage resulted in a cascade of errors that perturbed platform stability and customer service. A detailed explanation of encountered issues, what we’ve learned.
On Monday, March 3rd, 2025 Clever Cloud’s platform became unstable, leading to downtime of varying duration and scope, for many customers. The impact was both on the API itself, and on some services running as add-ons or runtimes.
These events occurred because:
- A power outage on one of our data center providers caused a temporary disruption of our services on the PAR6 Availability Zone (AZ) from EU-FR-1 at 14:55 CET
- As an indirect consequence, networking traffic was perturbed, the infrastructure experienced I/O and memory pressure between 14:58 and 16:30 CET leading to instabilities and downtime.
Impacts were observed for customers relying on the EU-FR-1 (PAR) region, and remote zones depending on the EU-FR-1 control plane (GOV, OVHcloud, Scaleway, and Oracle) and some privates zones. On-premise zones weren’t impacted.
Status were communicated through : https://www.clevercloudstatus.com/incident/959
Product | Impact |
---|---|
Runtimes | Applications may have experienced performance degradation and traffic issues due to impacted hypervisors and load balancers |
Runtimes deployment | The incident prevented new deployments, upscaling and downscaling operations from succeeding due to the Deployment API being unreachable |
Databases (MySQL, PostgreSQL, Redis, MongoDB) | The incident may have caused performance degradation or unavailability for customers whose databases were running on affected hypervisors. Data loss did not occur |
Cluster services (Elastic, Couchbase, Cassandra, …) | The service may have been maintained on some nodes, depending on the infrastructure pressure and the technology used. While some clusters were impacted during the incident, others were not |
VPN / IPSec | Most of the VPN endpoints were unavailable during the incident |
Observability | The service has successfully recovered from the electrical issue, and logs were available |
Timeline
Timeline | Component | Description |
---|---|---|
14:55 CET | Infrastructure | Power outage at PAR6 (Green Data) data center |
14:56 CET | Load Balancers | Monitoring advertises the loss of a load balancer - 1st event |
14:56 CET | Infrastructure | Monitoring shows a global incident reporting many unavailabilities |
14:57 CET | Infrastructure | Infrastructure team opens a crisis conference |
14:58 CET | Infrastructure | Power is back, servers and network equipment are restarting (confirmed later by data center incident report) |
14:58 CET | Customers | Support team received notification from external monitoring and customers that some websites are down (Console, admin, etc.) |
14:58 CET | Customers | Tech team reported that GitLab is also unavailable |
14:58 CET | Infrastructure | Tech team created an incident canal #incident-03-03-2025 |
15:00 CET | Infrastructure | Infrastructure team summoned others on call teams and members on canal #incident-03-03-2025 |
15:02 CET | Infrastructure | Hypervisors and network equipment have rebooted |
15:02 CET | Infrastructure | Several PAR6 hypervisors (12 of them) are indicating that their RAID array are rebuilding |
15:05 CET | Infrastructure | Reporting that AZ Monitoring is under heavy load |
15:06 CET | Infrastructure | Infrastructure team confirmed that we have retrieved every hypervisors on PAR6 since at least 4 minutes to 6 minutes |
15:08 CET | Network | A network provider hosted in PAR6 reports 2 powerline cuts |
15:10 CET | Infrastructure | Report from one of our network provider confirming the electrical outage |
15:14 CET | Messaging | Pulsar is back up online |
15:14 CET | Observability | Pulsar in-memory observability didn’t restart properly |
15:18 CET | Observability | Pulsar in-memory observability is online |
15:19 CET | Infrastructure | A RabbitMQ node came back to the cluster after we fixed the communication between the hypervisor and the queue |
15:30 CET | Infrastructure | Seeing a lot of WireGuard connectivity issues on multiple distributed clusters |
15:30 CET | DBaaS | Listed all GitLab databases |
15:33 CET | Infrastructure | Some hypervisors are overloaded, probably due to Vector/Telegraf not being able to push data to Warp10/Pulsar. Started to list all VMs with CPU load > 1 to relieve the infrastructure |
15:37 CET | Infrastructure | Check if Ceph/Cellar are okay. Some daemons seem to hang and 1/4 of OSD are still down on PAR6, requests are served correctly |
15:38 CET | DBaaS | Asked DBaaS team to prioritize Clever Cloud’s GitLab databases |
15:38 CET | Infrastructure | Listed all applications with VMs CPU load > 1, starting to kill them to relieve the infrastructure |
15:42 CET | Infrastructure | Fixed a network setup issue on a specific Hypervisor |
15:45 CET | Messaging | Listed all GitLab and Premium services databases for DBaaS team |
15:45 CET | Observability | A global maintenance is set on all monitoring servers in EU-FR-1 to avoid flooding the Orchestration core API with redeploy events |
15:45 CET | DBaaS | Asked to reboot PostgreSQL add-on provider in priority |
15:49 CET | Deployments | We lacked 2.8k VMs on the infrastructure |
15:52 CET | Infrastructure | Some deployments are failing on PAR6 due to unbound on some HV not restarted, no DNS available |
15:54 CET | Observability | See timeouts on Vector logs when connecting to Pulsar |
15:55 CET | Infrastructure | Unlock first HVs to allow databases reboot (HV iptables rules) |
15:56 CET | Messaging | We hit the max connections threshold on Pulsar |
15:56 CET | Infrastructure | Vector logs timeouts fixed, by corrective iptables on hypervisor |
15:59 CET | Infrastructure | All Hypervisor iptables generated are restored on PAR6 HVs and services have been restarted |
16:03 CET | Infrastructure | Connectivity issues due to conntrack issue (need to purge conntrack table) |
16:05 CET | Customers | Some load balancers are unsynced due to RabbitMQ crash |
16:05 CET | Load Balancers | Start the rolling restart of unsynced load balancers |
16:06 CET | Infrastructure | Some deployments are still failing on PAR6 due to WireGuard problems with the DNS recursor of the AZ, the recursor is dropped from unbound / resolv.conf of all PAR6 hypervisors |
16:07 CET | Load Balancers | For some applications it’s not a load balancer issue, but the application have no instances running |
16:08 CET | Infrastructure | Dropped UDP conntracks on the hypervisor of PAR6 DNS recursor to recover WireGuard mesh topologies |
16:19 CET | Messaging | WireGuard issues on BookKeeper cluster for Pulsar |
16:28 CET | Deployments | Purge delayed alerts on Zabbix monitoring |
16:30 CET | Deployments | Restart all applications without any VMs |
16:34 CET | Deployments | Reboot of Keycloack |
16:42 CET | Customers | Nodes of Couchbase have been reintroduced in their cluster |
16:45 CET | Infrastructure | Finished killing overloaded VMs |
16:47 CET | Deployments | Stop instances that are blocked and have failed their deployments and redeploy the application |
16:47 CET | Deployments | Fix a Ruby error on bastion that prevent to redeploy application |
16:58 CET | Deployments | Stacking semaphores in BDS Orchestrator |
16:58 CET | Infrastructure | Network issue on Warp10 FoundationDB |
17:00 CET | Deployments | Disabled HA replication of some queues to speed up RabbitMQ |
17:18 CET | Load Balancers | Restart all add-on providers to be able to properly reload last load balancers |
17:20 CET | Load Balancers | End the rolling restart of all impacted load balancers at Clever Cloud |
17:28 CET | Observability | Zabbix on PAR8 back in action |
17:31 CET | Deployments | Still 1.3k VMs to deploy |
17:35 CET | Infrastructure | Two FoundationDB’s VMs could not be started because the NVMe was unavailable |
17:37 CET | Deployments | Increased concurrent deployments from 300 to 400 to increase recovery speed |
17:50 CET | Observability | Cleaning up old hosts in Paris Monitoring to avoid sending too much outdated monitoring events |
17:57 CET | Observability | Zabbix on PAR7 back in action |
18:00 CET | Deployments | Started to restart all applications still seen as unavailable by Monitoring |
18:05 CET | Infrastructure | Restart Sydney orchestration that are unsynced |
18:05 CET | Observability | Zabbix on PAR6 back in action |
18:18 CET | Infrastructure | Force to stop all VMs in stopping status |
18:45 CET | Observability | Remove the maintenance mode of Monitoring |
18:53 CET | Deployments | Deploy last application in failing deployment state |
18:54 CET | Infrastructure | Investigating under-replicated Kafka partitions on Warp10 |
18:54 CET | Infrastructure | Investigating under-replicated Kafka partitions on Warp10 |
19:11 CET | Deployments | Finished to restart all applications still seen as unavailable by Zabbix |
19:12 CET | Infrastructure | Starting to recover the Ceph cluster, 2 physical machines don’t have any network, one has a network configuration issue |
19:16 CET | Infrastructure | Start the recovery of Ceph cluster |
19:23 CET | Core APIs | Restart of Logs API |
20:25 CET | Infrastructure | All Ceph machines are correctly restarted, cluster is recovering 0.3% of degraded objects |
20:48 CET | Infrastructure | Ceph cluster has recovered degraded objects, HEALTH_OK |
Analysis
Power Outage
We’re still waiting for a detailed explanation from the data center operator, but the same day was scheduled a planned maintenance which consisted in testing UPS for a given time. We observed a 5 seconds window between the power cut and the power recovery.
It led to an instant restart of our entire Availability Zone (AZ). Unfortunately, we can’t provide more insights about the origin of the failure for the moment except a faulty component has cause the power cut, which is unacceptable since we’re running at least 2N (most are 2N+1) power adduction with separated power source. A single faulty component shouldn’t have any impact on the two power lines.
We’re expecting a better postmortem from our data center operator.
Networking connectivity issue
Our network relies on multiple transits to ensure both a good Internet connectivity and a good resilience. These transits are spread among three Points of Presence (PoP). During the restart of networking routers, there appears to be an issue with accessing the region via public connectivity, despite these multiple transits. Our observability system was unfortunately located on the lost AZ, so we lost the flows to analyze the traffic during the power cut. Known Issues:
- In some cases, when a router reboots, it may start receiving traffic before it knows where to route it. A specific starting sequence may be enforced to avoid this behavior.
Possible Causes of Downtime
- A transit partner did not notice that the session was down before the BGP session timed out (a few minutes), so traffic continued to be routed through them even though we were no longer there to handle it. Another transit partner continued to announce our prefixes even though they couldn’t route the traffic to our network.
- There is a 3-minutes timeout on the BGP sessions of the routers, with no Bidirectional Forwarding Detection (BFD) to accelerate convergence.
Unclear Areas
- In the nominal situation, most of our traffic is gathered at TH2 POP, so we shouldn’t have seen such a significant impact. However, most of our traffic exits through another transit partner through the incriminated AZ (correlate with the one of the possible cause of downtime)
- Did the other routers still have announcements (received from the power cut AZ routers) when they received traffic to send to the internet, because the BGP sessions hadn’t yet timed out? (correlate with the one of the possible cause of downtime)
Network issues were gone at 15:15 CET. We run a simulated copy of our networking topology inside the GNS3 emulator. We will quickly replicate this hypothesis to try to reproduce and tests some enhancements.
Messaging Maintenance (Apache Pulsar)
Learning from a previous outage last summer, we invested a lot in this layer and enhancements made likely focused on improving the resilience and fault tolerance of the system. This includes better monitoring, automated failover processes, auto healing and enhanced data replication strategies with better isolation.
By leveraging Pulsar’s disaggregated architecture and the investments made in infrastructure and operational practices, the system was able to recover swiftly from the loss of an AZ. This resilience underscores the importance of continuous improvement and preparedness in maintaining high availability and reliability.
While the service has been wavering for a couple of minutes, reshuffling all connections on remaining AZs, we confirmed the Pulsar’s ability to recover quickly. Yet the proxy layer combined with the metadata management wasn’t enough to quickly accept massive new connections (hundreds of thousands) in a very short time window.
We already identified the problem for metadata management, and along the year we made many improvements which led us to decide we need a radical enhancement for a more scalable solution. Unfortunately this work is not ready yet. We will also enhance the connection management.
Hypervisor IO/CPU pressure, images
The difficulty for accepting massive new connections on the Apache Pulsar messaging layer has led to the buffering of telemetry agents on the Virtual Machines (VMs) running on Hypervisors (servers). Customers VMs embed an observability agent named Vector we chose for its ability to be very efficient and safe. We anticipated these agents buffering would induce an I/O pressure on the hypervisor.
This overload had an impact on small VMs that reached a limit of memory which triggered swapping then a lot of pressure on the hypervisor. We changed our image base format from the last months to enhance the testability of images. Guard rules have been developed to ensure a better process isolation using CGroups inside these new images, preventing small VMs to stress memory. Unfortunately these changes were in evaluation to ensure the stability of the design, but didn’t make it to production for the incident.
As a consequence, these VMs overloaded was seen as unavailable from the monitoring, then rescheduled which causes a chicken and egg situation: we needed the scheduler to perform some massive actions and comply with deployments, but the control plane was performing sub-optimally and the infrastructure struggled to provide available resources.
Also, a few hours before the incident, we deployed multiple enhanced images, with new tooling but fewer available space. This led to deployment errors for some customers. We rolled back some images globally or for some applications, and will fix this issue until next images update.
Orchestration
Initially, we implemented an orchestration system based on an offering model (similar to Apache Mesos), which provided a solid foundation for managing our services. In an offering model, an orchestrator manages the allocation and deallocation of resources, such as virtual machines (VMs), to ensure efficient utilization and meet service demands.
However, the orchestrator can become stuck due to semaphore locking issues during VM assignment for several reasons:
- Resource Contention: Semaphores are used to control access to shared resources, ensuring that only a limited number of processes can access them simultaneously. If multiple VMs are attempting to acquire the same resource concurrently, contention can occur, leading to semaphore locks. This can cause the orchestrator to wait indefinitely if the locks are not released promptly.
- Deadlocks: A deadlock situation arises when two or more processes are each holding a resource and waiting for the other to release a resource they need. In the context of VM assignment, if the orchestrator and other processes are waiting for each other to release semaphores, it can result in a deadlock, causing the orchestrator to become stuck.
- Improper Semaphore Management: If semaphores are not properly initialized, acquired, or released, it can lead to locking issues. For instance, if a semaphore is not released after a VM assignment operation completes, other processes waiting for that semaphore will remain blocked, including the orchestrator.
- High Load and Scalability Issues: Under high load conditions, the orchestrator may struggle to manage the increased number of VM assignment requests. If semaphores are used to serialize access to critical sections of code, excessive locking can occur, leading to performance bottlenecks and causing the orchestrator to become unresponsive.
- Timeouts and Retries: If the orchestrator encounters timeouts while waiting for semaphores to be released, it may repeatedly retry the operation. Without proper back-off strategies, this can exacerbate contention and further delay VM assignments, leading to a stuck state.
To mitigate these issues, it’s essential to implement robust semaphore management practices, monitor resource usage, and employ strategies to detect and resolve deadlocks. Additionally, optimizing the orchestrator’s algorithms for scalability and efficiency can help prevent semaphore locking problems.
The problem is, while this locking mechanism helps to provide the best determinism in an offering model, it can also lower the rate of VM instance creation. During the recovery process, we had to continuously adjust this locking variable to allow more service to be queued for scheduling.
However, recognizing the need for enhanced capabilities and a broader range of product offerings, we have transitioned over the past few months to a new orchestration model. This shift is designed to offer more performance, more resiliency, advanced features and support a wider diversity of products. The new orchestration is being rapidly implemented as a new backend, and we are making swift progress in integrating it into our existing infrastructure. This upgrade will not only improve our operational efficiency but also enable us to deliver more innovative solutions to our clients. This is already in production for the Logs Drains which behaved as expected during the incident.
Actions
Databases
- #63 Enhance replication monitoring for MySQL databases
- #290 Enhance all databases with replication API so that more customers can add a replication database for more resiliency
Runtimes
- #856 Enforce guard I/O rules at Hypervisor level
- #857 Enhance observability for small profile VM memory pressure
Networking
- #832 Ensure OSPF/BGP convergence times on routers
- #898 Enhance boot sequence process on routers
- #899 Ensure a good resiliency for border routing BGP announces
Orchestration
- #900 Ensure a failsafe VM restart capability
- #901 Enhance hypervisor sequence of start process
- #902 Ensure a better network configuration compliance across upgrades
- #864 Better observability for unresponsive VM
- #903 Improve AZ monitoring resiliency
Conclusion
Our response to this outage demonstrated significant improvements compared to previous incidents. The team’s actions and coordinated efforts mitigated the impact and expedited the resolution process. Most of the challenges we encountered were related to ongoing projects and enhancements that are nearing completion. These issues have provided valuable insights that will further strengthen our systems and processes.
We are confident that the measures we are implementing will prevent similar occurrences in the future and enhance our overall reliability and performance. We’re sincerely sorry for the inconvenience caused to our customers, and we’re committed to provide the best quality of service. We plan to organize a meeting for every demanding customer to explain both the incident and our actions.
Did this documentation help you ?