Minor update, bigger consequences
The global incident started after we initiated a minor rolling maintenance upgrade of Apache Pulsar (3.3.1). It enabled a new balancer algorithm for bundles placement optimization which caused troubles. We quickly detected it, kept the 3.3.1 release and rolled back the defective configuration.
The initial problem, although fixed, resulted in a metadata conflict. We had to stop all messaging brokers to start only one, preventing metadata conflict between them, and succeeded in getting back to a nominal situation. But after that our infrastructure experienced I/O and memory pressure, causing some hypervisors to crash.
The unavailability of the Apache Pulsar messaging layer has led to the buffering of some telemetry agents on the virtual machines (VMs) running on hypervisors (HVs). They started to buffer in memory while messaging service endpoints weren’t available.
Reach the limits of our smaller VMs
For small VMs it reached a memory limit which triggered a lot of pressure on the kernel. When it happens, the kernel tries to flush memory and caches, remove processes from memory to load them instruction per instruction from disk. This is generating a lot of I/O disk pressure on the underlying hypervisor. A few minutes after HVs load increased, we started to see kernel panics on some of them.
Small VMs are spreaded globally among our Availability Zones, so we ended up having an overloaded server infrastructure. As we needed the scheduler to manage the situation, the control plane was performing suboptimally and the infrastructure struggled to provide available resources.
So we had to shut down all non critical services and our SRE team redeployed one by one the overloaded small VMs. Then we started to regain control over all hypervisors and were able to take our infrastructure back to its optimal state.
What’s next?
For non minor releases, we use a simulation process that emulates a full environment where we inject changes. This validates upgrades and changes by observing the infrastructure behavior before going to production. This process wasn’t used for minor maintenance upgrades or (apparently) small changes in the configuration profile.
This incident taught us a lot and we identified some progress areas regarding our images, tools configuration and integration, or small VMs setup. We’ve taken some immediate actions and will work on more deep topics in the coming weeks.
We sincerely apologize for any inconvenience this incident may have caused. Your trust is paramount to us, and we are committed to providing you with the highest level of service reliability.
You can learn more about the schedule and technical analysis of the incident in our public postmortem. If you have any questions or concerns, please don’t hesitate to reach out to our support team.