A minor update resulted in a cascade of errors: how it went wrong, what we’ve learnt

Incident
On Friday, August 2nd, 2024 Clever Cloud’s platform became very unstable, leading to downtime of varying duration and scope, for customers using services on the EU-FR-1 (PAR) region, and remote zones depending on the EU-FR-1 control plane (OVHcloud, Scaleway, and Oracle). Privates and on-premise zones weren’t impacted.

Minor update, bigger consequences

The global incident started after we initiated a minor rolling maintenance upgrade of Apache Pulsar (3.3.1). It enabled a new balancer algorithm for bundles placement optimization which caused troubles. We quickly detected it, kept the 3.3.1 release and rolled back the defective configuration. 

The initial problem, although fixed, resulted in a metadata conflict. We had to stop all messaging brokers to start only one, preventing metadata conflict between them, and succeeded in getting back to a nominal situation. But after that our infrastructure experienced I/O and memory pressure, causing some hypervisors to crash.

The unavailability of the Apache Pulsar messaging layer has led to the buffering of some telemetry agents on the virtual machines (VMs) running on hypervisors (HVs). They started to buffer in memory while messaging service endpoints weren’t available.

Reach the limits of our smaller VMs 

For small VMs it reached a memory limit which triggered a lot of pressure on the kernel. When it happens, the kernel tries to flush memory and caches, remove processes from memory to load them instruction per instruction from disk. This is generating a lot of I/O disk pressure on the underlying hypervisor. A few minutes after HVs load increased, we started to see kernel panics on some of them.

Small VMs are spreaded globally among our Availability Zones, so we ended up having an overloaded server infrastructure. As we needed the scheduler to manage the situation, the control plane was performing suboptimally and the infrastructure struggled to provide available resources.

So we had to shut down all non critical services and our SRE team redeployed one by one the overloaded small VMs. Then we started to regain control over all hypervisors and were able to take our infrastructure back to its optimal state.

What’s next?

For non minor releases, we use a simulation process that emulates a full environment where we inject changes. This validates upgrades and changes by observing the infrastructure behavior before going to production. This process wasn’t used for minor maintenance upgrades or (apparently) small changes in the configuration profile.

This incident taught us a lot and we identified some progress areas regarding our images, tools configuration and integration, or small VMs setup. We’ve taken some immediate actions and will work on more deep topics in the coming weeks. 

We sincerely apologize for any inconvenience this incident may have caused. Your trust is paramount to us, and we are committed to providing you with the highest level of service reliability. 

You can learn more about the schedule and technical analysis of the incident in our public postmortem. If you have any questions or concerns, please don’t hesitate to reach out to our support team.

Blog

À lire également

Clever Tools: a year of enhancements for your deployments, on the road to v4

A command line interface (CLI) is at the core of developer experience. At Clever Cloud, we have been providing Clever Tools for almost 10 years.
Engineering Features

Otoroshi with LLM: simplify your API and AI service management on Clever Cloud

Your applications and services are evolving in an increasingly complex environment, requiring effective management of APIs and interactions with artificial intelligence models such as the very popular LLMs (Large Language Models).
Features

Markitdown-as-a-Service: from AI to production on Clever Cloud

Every day, new tools are released, AI brings new perspectives, you have new ideas. It's one of Clever Cloud's missions to help you to develop and test them in real-life conditions, effortlessly, before making them available to everyone.
Engineering