2025-03-03

2025-03-03

An electrical power outage resulted in a cascade of errors that perturbed platform stability and customer service. A detailed explanation of encountered issues, what we’ve learned.

On Monday, March 3rd, 2025 Clever Cloud’s platform became unstable, leading to downtime of varying duration and scope, for many customers. The impact was both on the API itself, and on some services running as add-ons or runtimes.

These events occurred because:

  • A power outage on one of our data center providers caused a temporary disruption of our services on the PAR6 Availability Zone (AZ) from EU-FR-1 at 14:55 CET
  • As an indirect consequence, networking traffic was perturbed, the infrastructure experienced I/O and memory pressure between 14:58 and 16:30 CET leading to instabilities and downtime.

Impacts were observed for customers relying on the EU-FR-1 (PAR) region, and remote zones depending on the EU-FR-1 control plane (GOV, OVHcloud, Scaleway, and Oracle) and some privates zones. On-premise zones weren’t impacted.

Status were communicated through : https://www.clevercloudstatus.com/incident/959

ProductImpact
RuntimesApplications may have experienced performance degradation and traffic issues due to impacted hypervisors and load balancers
Runtimes deploymentThe incident prevented new deployments, upscaling and downscaling operations from succeeding due to the Deployment API being unreachable
Databases (MySQL, PostgreSQL, Redis, MongoDB)The incident may have caused performance degradation or unavailability for customers whose databases were running on affected hypervisors. Data loss did not occur
Cluster services (Elastic, Couchbase, Cassandra, …)The service may have been maintained on some nodes, depending on the infrastructure pressure and the technology used. While some clusters were impacted during the incident, others were not
VPN / IPSecMost of the VPN endpoints were unavailable during the incident
ObservabilityThe service has successfully recovered from the electrical issue, and logs were available

Timeline

TimelineComponentDescription
14:55 CETInfrastructurePower outage at PAR6 (Green Data) data center
14:56 CETLoad BalancersMonitoring advertises the loss of a load balancer - 1st event
14:56 CETInfrastructureMonitoring shows a global incident reporting many unavailabilities
14:57 CETInfrastructureInfrastructure team opens a crisis conference
14:58 CETInfrastructurePower is back, servers and network equipment are restarting (confirmed later by data center incident report)
14:58 CETCustomersSupport team received notification from external monitoring and customers that some websites are down (Console, admin, etc.)
14:58 CETCustomersTech team reported that GitLab is also unavailable
14:58 CETInfrastructureTech team created an incident canal #incident-03-03-2025
15:00 CETInfrastructureInfrastructure team summoned others on call teams and members on canal #incident-03-03-2025
15:02 CETInfrastructureHypervisors and network equipment have rebooted
15:02 CETInfrastructureSeveral PAR6 hypervisors (12 of them) are indicating that their RAID array are rebuilding
15:05 CETInfrastructureReporting that AZ Monitoring is under heavy load
15:06 CETInfrastructureInfrastructure team confirmed that we have retrieved every hypervisors on PAR6 since at least 4 minutes to 6 minutes
15:08 CETNetworkA network provider hosted in PAR6 reports 2 powerline cuts
15:10 CETInfrastructureReport from one of our network provider confirming the electrical outage
15:14 CETMessagingPulsar is back up online
15:14 CETObservabilityPulsar in-memory observability didn’t restart properly
15:18 CETObservabilityPulsar in-memory observability is online
15:19 CETInfrastructureA RabbitMQ node came back to the cluster after we fixed the communication between the hypervisor and the queue
15:30 CETInfrastructureSeeing a lot of WireGuard connectivity issues on multiple distributed clusters
15:30 CETDBaaSListed all GitLab databases
15:33 CETInfrastructureSome hypervisors are overloaded, probably due to Vector/Telegraf not being able to push data to Warp10/Pulsar. Started to list all VMs with CPU load > 1 to relieve the infrastructure
15:37 CETInfrastructureCheck if Ceph/Cellar are okay. Some daemons seem to hang and 1/4 of OSD are still down on PAR6, requests are served correctly
15:38 CETDBaaSAsked DBaaS team to prioritize Clever Cloud’s GitLab databases
15:38 CETInfrastructureListed all applications with VMs CPU load > 1, starting to kill them to relieve the infrastructure
15:42 CETInfrastructureFixed a network setup issue on a specific Hypervisor
15:45 CETMessagingListed all GitLab and Premium services databases for DBaaS team
15:45 CETObservabilityA global maintenance is set on all monitoring servers in EU-FR-1 to avoid flooding the Orchestration core API with redeploy events
15:45 CETDBaaSAsked to reboot PostgreSQL add-on provider in priority
15:49 CETDeploymentsWe lacked 2.8k VMs on the infrastructure
15:52 CETInfrastructureSome deployments are failing on PAR6 due to unbound on some HV not restarted, no DNS available
15:54 CETObservabilitySee timeouts on Vector logs when connecting to Pulsar
15:55 CETInfrastructureUnlock first HVs to allow databases reboot (HV iptables rules)
15:56 CETMessagingWe hit the max connections threshold on Pulsar
15:56 CETInfrastructureVector logs timeouts fixed, by corrective iptables on hypervisor
15:59 CETInfrastructureAll Hypervisor iptables generated are restored on PAR6 HVs and services have been restarted
16:03 CETInfrastructureConnectivity issues due to conntrack issue (need to purge conntrack table)
16:05 CETCustomersSome load balancers are unsynced due to RabbitMQ crash
16:05 CETLoad BalancersStart the rolling restart of unsynced load balancers
16:06 CETInfrastructureSome deployments are still failing on PAR6 due to WireGuard problems with the DNS recursor of the AZ, the recursor is dropped from unbound / resolv.conf of all PAR6 hypervisors
16:07 CETLoad BalancersFor some applications it’s not a load balancer issue, but the application have no instances running
16:08 CETInfrastructureDropped UDP conntracks on the hypervisor of PAR6 DNS recursor to recover WireGuard mesh topologies
16:19 CETMessagingWireGuard issues on BookKeeper cluster for Pulsar
16:28 CETDeploymentsPurge delayed alerts on Zabbix monitoring
16:30 CETDeploymentsRestart all applications without any VMs
16:34 CETDeploymentsReboot of Keycloack
16:42 CETCustomersNodes of Couchbase have been reintroduced in their cluster
16:45 CETInfrastructureFinished killing overloaded VMs
16:47 CETDeploymentsStop instances that are blocked and have failed their deployments and redeploy the application
16:47 CETDeploymentsFix a Ruby error on bastion that prevent to redeploy application
16:58 CETDeploymentsStacking semaphores in BDS Orchestrator
16:58 CETInfrastructureNetwork issue on Warp10 FoundationDB
17:00 CETDeploymentsDisabled HA replication of some queues to speed up RabbitMQ
17:18 CETLoad BalancersRestart all add-on providers to be able to properly reload last load balancers
17:20 CETLoad BalancersEnd the rolling restart of all impacted load balancers at Clever Cloud
17:28 CETObservabilityZabbix on PAR8 back in action
17:31 CETDeploymentsStill 1.3k VMs to deploy
17:35 CETInfrastructureTwo FoundationDB’s VMs could not be started because the NVMe was unavailable
17:37 CETDeploymentsIncreased concurrent deployments from 300 to 400 to increase recovery speed
17:50 CETObservabilityCleaning up old hosts in Paris Monitoring to avoid sending too much outdated monitoring events
17:57 CETObservabilityZabbix on PAR7 back in action
18:00 CETDeploymentsStarted to restart all applications still seen as unavailable by Monitoring
18:05 CETInfrastructureRestart Sydney orchestration that are unsynced
18:05 CETObservabilityZabbix on PAR6 back in action
18:18 CETInfrastructureForce to stop all VMs in stopping status
18:45 CETObservabilityRemove the maintenance mode of Monitoring
18:53 CETDeploymentsDeploy last application in failing deployment state
18:54 CETInfrastructureInvestigating under-replicated Kafka partitions on Warp10
18:54 CETInfrastructureInvestigating under-replicated Kafka partitions on Warp10
19:11 CETDeploymentsFinished to restart all applications still seen as unavailable by Zabbix
19:12 CETInfrastructureStarting to recover the Ceph cluster, 2 physical machines don’t have any network, one has a network configuration issue
19:16 CETInfrastructureStart the recovery of Ceph cluster
19:23 CETCore APIsRestart of Logs API
20:25 CETInfrastructureAll Ceph machines are correctly restarted, cluster is recovering 0.3% of degraded objects
20:48 CETInfrastructureCeph cluster has recovered degraded objects, HEALTH_OK

Analysis

Power Outage

We’re still waiting for a detailed explanation from the data center operator, but the same day was scheduled a planned maintenance which consisted in testing UPS for a given time. We observed a 5 seconds window between the power cut and the power recovery.

It led to an instant restart of our entire Availability Zone (AZ). Unfortunately, we can’t provide more insights about the origin of the failure for the moment except a faulty component has cause the power cut, which is unacceptable since we’re running at least 2N (most are 2N+1) power adduction with separated power source. A single faulty component shouldn’t have any impact on the two power lines.

We’re expecting a better postmortem from our data center operator.

Networking connectivity issue

Our network relies on multiple transits to ensure both a good Internet connectivity and a good resilience. These transits are spread among three Points of Presence (PoP). During the restart of networking routers, there appears to be an issue with accessing the region via public connectivity, despite these multiple transits. Our observability system was unfortunately located on the lost AZ, so we lost the flows to analyze the traffic during the power cut. Known Issues:

  • In some cases, when a router reboots, it may start receiving traffic before it knows where to route it. A specific starting sequence may be enforced to avoid this behavior.

Possible Causes of Downtime

  • A transit partner did not notice that the session was down before the BGP session timed out (a few minutes), so traffic continued to be routed through them even though we were no longer there to handle it. Another transit partner continued to announce our prefixes even though they couldn’t route the traffic to our network.
  • There is a 3-minutes timeout on the BGP sessions of the routers, with no Bidirectional Forwarding Detection (BFD) to accelerate convergence.

Unclear Areas

  • In the nominal situation, most of our traffic is gathered at TH2 POP, so we shouldn’t have seen such a significant impact. However, most of our traffic exits through another transit partner through the incriminated AZ (correlate with the one of the possible cause of downtime)
  • Did the other routers still have announcements (received from the power cut AZ routers) when they received traffic to send to the internet, because the BGP sessions hadn’t yet timed out? (correlate with the one of the possible cause of downtime)

Network issues were gone at 15:15 CET. We run a simulated copy of our networking topology inside the GNS3 emulator. We will quickly replicate this hypothesis to try to reproduce and tests some enhancements.

Messaging Maintenance (Apache Pulsar)

Learning from a previous outage last summer, we invested a lot in this layer and enhancements made likely focused on improving the resilience and fault tolerance of the system. This includes better monitoring, automated failover processes, auto healing and enhanced data replication strategies with better isolation.

By leveraging Pulsar’s disaggregated architecture and the investments made in infrastructure and operational practices, the system was able to recover swiftly from the loss of an AZ. This resilience underscores the importance of continuous improvement and preparedness in maintaining high availability and reliability.

While the service has been wavering for a couple of minutes, reshuffling all connections on remaining AZs, we confirmed the Pulsar’s ability to recover quickly. Yet the proxy layer combined with the metadata management wasn’t enough to quickly accept massive new connections (hundreds of thousands) in a very short time window.

We already identified the problem for metadata management, and along the year we made many improvements which led us to decide we need a radical enhancement for a more scalable solution. Unfortunately this work is not ready yet. We will also enhance the connection management.

Hypervisor IO/CPU pressure, images

The difficulty for accepting massive new connections on the Apache Pulsar messaging layer has led to the buffering of telemetry agents on the Virtual Machines (VMs) running on Hypervisors (servers). Customers VMs embed an observability agent named Vector we chose for its ability to be very efficient and safe. We anticipated these agents buffering would induce an I/O pressure on the hypervisor.

This overload had an impact on small VMs that reached a limit of memory which triggered swapping then a lot of pressure on the hypervisor. We changed our image base format from the last months to enhance the testability of images. Guard rules have been developed to ensure a better process isolation using CGroups inside these new images, preventing small VMs to stress memory. Unfortunately these changes were in evaluation to ensure the stability of the design, but didn’t make it to production for the incident.

As a consequence, these VMs overloaded was seen as unavailable from the monitoring, then rescheduled which causes a chicken and egg situation: we needed the scheduler to perform some massive actions and comply with deployments, but the control plane was performing sub-optimally and the infrastructure struggled to provide available resources.

Also, a few hours before the incident, we deployed multiple enhanced images, with new tooling but fewer available space. This led to deployment errors for some customers. We rolled back some images globally or for some applications, and will fix this issue until next images update.

Orchestration

Initially, we implemented an orchestration system based on an offering model (similar to Apache Mesos), which provided a solid foundation for managing our services. In an offering model, an orchestrator manages the allocation and deallocation of resources, such as virtual machines (VMs), to ensure efficient utilization and meet service demands.

However, the orchestrator can become stuck due to semaphore locking issues during VM assignment for several reasons:

  • Resource Contention: Semaphores are used to control access to shared resources, ensuring that only a limited number of processes can access them simultaneously. If multiple VMs are attempting to acquire the same resource concurrently, contention can occur, leading to semaphore locks. This can cause the orchestrator to wait indefinitely if the locks are not released promptly.
  • Deadlocks: A deadlock situation arises when two or more processes are each holding a resource and waiting for the other to release a resource they need. In the context of VM assignment, if the orchestrator and other processes are waiting for each other to release semaphores, it can result in a deadlock, causing the orchestrator to become stuck.
  • Improper Semaphore Management: If semaphores are not properly initialized, acquired, or released, it can lead to locking issues. For instance, if a semaphore is not released after a VM assignment operation completes, other processes waiting for that semaphore will remain blocked, including the orchestrator.
  • High Load and Scalability Issues: Under high load conditions, the orchestrator may struggle to manage the increased number of VM assignment requests. If semaphores are used to serialize access to critical sections of code, excessive locking can occur, leading to performance bottlenecks and causing the orchestrator to become unresponsive.
  • Timeouts and Retries: If the orchestrator encounters timeouts while waiting for semaphores to be released, it may repeatedly retry the operation. Without proper back-off strategies, this can exacerbate contention and further delay VM assignments, leading to a stuck state.

To mitigate these issues, it’s essential to implement robust semaphore management practices, monitor resource usage, and employ strategies to detect and resolve deadlocks. Additionally, optimizing the orchestrator’s algorithms for scalability and efficiency can help prevent semaphore locking problems.

The problem is, while this locking mechanism helps to provide the best determinism in an offering model, it can also lower the rate of VM instance creation. During the recovery process, we had to continuously adjust this locking variable to allow more service to be queued for scheduling.

However, recognizing the need for enhanced capabilities and a broader range of product offerings, we have transitioned over the past few months to a new orchestration model. This shift is designed to offer more performance, more resiliency, advanced features and support a wider diversity of products. The new orchestration is being rapidly implemented as a new backend, and we are making swift progress in integrating it into our existing infrastructure. This upgrade will not only improve our operational efficiency but also enable us to deliver more innovative solutions to our clients. This is already in production for the Logs Drains which behaved as expected during the incident.

Actions

Databases

  • #63 Enhance replication monitoring for MySQL databases
  • #290 Enhance all databases with replication API so that more customers can add a replication database for more resiliency

Runtimes

  • #856 Enforce guard I/O rules at Hypervisor level
  • #857 Enhance observability for small profile VM memory pressure

Networking

  • #832 Ensure OSPF/BGP convergence times on routers
  • #898 Enhance boot sequence process on routers
  • #899 Ensure a good resiliency for border routing BGP announces

Orchestration

  • #900 Ensure a failsafe VM restart capability
  • #901 Enhance hypervisor sequence of start process
  • #902 Ensure a better network configuration compliance across upgrades
  • #864 Better observability for unresponsive VM
  • #903 Improve AZ monitoring resiliency

Conclusion

Our response to this outage demonstrated significant improvements compared to previous incidents. The team’s actions and coordinated efforts mitigated the impact and expedited the resolution process. Most of the challenges we encountered were related to ongoing projects and enhancements that are nearing completion. These issues have provided valuable insights that will further strengthen our systems and processes.

We are confident that the measures we are implementing will prevent similar occurrences in the future and enhance our overall reliability and performance. We’re sincerely sorry for the inconvenience caused to our customers, and we’re committed to provide the best quality of service. We plan to organize a meeting for every demanding customer to explain both the incident and our actions.

Last updated on

Did this documentation help you ?