A minor update resulted in a cascade of errors: how it went wrong, what we’ve learnt

Incident
On Friday, August 2nd, 2024 Clever Cloud’s platform became very unstable, leading to downtime of varying duration and scope, for customers using services on the EU-FR-1 (PAR) region, and remote zones depending on the EU-FR-1 control plane (OVHcloud, Scaleway, and Oracle). Privates and on-premise zones weren’t impacted.

Minor update, bigger consequences

The global incident started after we initiated a minor rolling maintenance upgrade of Apache Pulsar (3.3.1). It enabled a new balancer algorithm for bundles placement optimization which caused troubles. We quickly detected it, kept the 3.3.1 release and rolled back the defective configuration. 

The initial problem, although fixed, resulted in a metadata conflict. We had to stop all messaging brokers to start only one, preventing metadata conflict between them, and succeeded in getting back to a nominal situation. But after that our infrastructure experienced I/O and memory pressure, causing some hypervisors to crash.

The unavailability of the Apache Pulsar messaging layer has led to the buffering of some telemetry agents on the virtual machines (VMs) running on hypervisors (HVs). They started to buffer in memory while messaging service endpoints weren’t available.

Reach the limits of our smaller VMs 

For small VMs it reached a memory limit which triggered a lot of pressure on the kernel. When it happens, the kernel tries to flush memory and caches, remove processes from memory to load them instruction per instruction from disk. This is generating a lot of I/O disk pressure on the underlying hypervisor. A few minutes after HVs load increased, we started to see kernel panics on some of them.

Small VMs are spreaded globally among our Availability Zones, so we ended up having an overloaded server infrastructure. As we needed the scheduler to manage the situation, the control plane was performing suboptimally and the infrastructure struggled to provide available resources.

So we had to shut down all non critical services and our SRE team redeployed one by one the overloaded small VMs. Then we started to regain control over all hypervisors and were able to take our infrastructure back to its optimal state.

What’s next?

For non minor releases, we use a simulation process that emulates a full environment where we inject changes. This validates upgrades and changes by observing the infrastructure behavior before going to production. This process wasn’t used for minor maintenance upgrades or (apparently) small changes in the configuration profile.

This incident taught us a lot and we identified some progress areas regarding our images, tools configuration and integration, or small VMs setup. We’ve taken some immediate actions and will work on more deep topics in the coming weeks. 

We sincerely apologize for any inconvenience this incident may have caused. Your trust is paramount to us, and we are committed to providing you with the highest level of service reliability. 

You can learn more about the schedule and technical analysis of the incident in our public postmortem. If you have any questions or concerns, please don’t hesitate to reach out to our support team.

Blog

À lire également

Clever Cloud and DataConnect Africa: the first step towards a secure, sovereign cloud for Africa

As part of its international growth, Clever Cloud has signed its first partnership in Côte d'Ivoire, with DataConnet Africa, to offer the region's first Sovereign and Secure cloud.
Company

clever cloud fest. is coming

Clever Cloud announces its clever cloud fest. an event to bring together customers, prospects, partners and more.
Company

Clever Cloud: Discover our clever side

Faced with increasingly demanding technical environments, Clever Cloud has established itself as a key player in supporting tech teams and IT decision-makers.
Company