A minor update resulted in a cascade of errors: how it went wrong, what we’ve learnt

Incident
On Friday, August 2nd, 2024 Clever Cloud’s platform became very unstable, leading to downtime of varying duration and scope, for customers using services on the EU-FR-1 (PAR) region, and remote zones depending on the EU-FR-1 control plane (OVHcloud, Scaleway, and Oracle). Privates and on-premise zones weren’t impacted.

Minor update, bigger consequences

The global incident started after we initiated a minor rolling maintenance upgrade of Apache Pulsar (3.3.1). It enabled a new balancer algorithm for bundles placement optimization which caused troubles. We quickly detected it, kept the 3.3.1 release and rolled back the defective configuration. 

The initial problem, although fixed, resulted in a metadata conflict. We had to stop all messaging brokers to start only one, preventing metadata conflict between them, and succeeded in getting back to a nominal situation. But after that our infrastructure experienced I/O and memory pressure, causing some hypervisors to crash.

The unavailability of the Apache Pulsar messaging layer has led to the buffering of some telemetry agents on the virtual machines (VMs) running on hypervisors (HVs). They started to buffer in memory while messaging service endpoints weren’t available.

Reach the limits of our smaller VMs 

For small VMs it reached a memory limit which triggered a lot of pressure on the kernel. When it happens, the kernel tries to flush memory and caches, remove processes from memory to load them instruction per instruction from disk. This is generating a lot of I/O disk pressure on the underlying hypervisor. A few minutes after HVs load increased, we started to see kernel panics on some of them.

Small VMs are spreaded globally among our Availability Zones, so we ended up having an overloaded server infrastructure. As we needed the scheduler to manage the situation, the control plane was performing suboptimally and the infrastructure struggled to provide available resources.

So we had to shut down all non critical services and our SRE team redeployed one by one the overloaded small VMs. Then we started to regain control over all hypervisors and were able to take our infrastructure back to its optimal state.

What’s next?

For non minor releases, we use a simulation process that emulates a full environment where we inject changes. This validates upgrades and changes by observing the infrastructure behavior before going to production. This process wasn’t used for minor maintenance upgrades or (apparently) small changes in the configuration profile.

This incident taught us a lot and we identified some progress areas regarding our images, tools configuration and integration, or small VMs setup. We’ve taken some immediate actions and will work on more deep topics in the coming weeks. 

We sincerely apologize for any inconvenience this incident may have caused. Your trust is paramount to us, and we are committed to providing you with the highest level of service reliability. 

You can learn more about the schedule and technical analysis of the incident in our public postmortem. If you have any questions or concerns, please don’t hesitate to reach out to our support team.

Blog

À lire également

Clever Cloud announces 11 new products at its Clever Cloud Fest

Clever Cloud is celebrating its 15th anniversary with the Clever Cloud Fest on 6 and 7 February 2025 in Nantes. This event will bring together customers and partners, during which Clever Cloud will unveil 11 new products and an international growth strategy.
Company Événements Press

What is Clever AI?

AI services are everywhere, but the sheer number and diversity of them makes it harder for developers to manage them. All the more so when they work in a team. That's why we came up with Clever AI: a multi-vendor, multi-model range of solutions to simplify access to AI for businesses.
Company

Protect yourself: beware of job scams spoofing Clever Cloud’s brand

At Clever Cloud, we provide reliable, secure cloud hosting services for businesses and developers worldwide. Unfortunately, our reputation is being exploited by malicious actors engaging in fraudulent activities under the guise of our company name. We want to set the record straight and help protect you from falling victim to these scams.
Company