Our journey to a better Clever Cloud

Over two years ago, we decided to strengthen the Clever Cloud team. At that time, our goal was to better support our customers in their growth, respond to their requests and complete the development of new products more efficiently.

Since then, we’ve grown from around fifteen people to almost sixty, with a “remote first” approach, but still with several offices in France: Brest, Nantes, Lille, Lyon, Paris, with others in the pipeline. Most of our staff are developers and engineers, organized into teams to deal with all our activities, the cloud being quite broad in scope. An organizational challenge, managed by our CTO Steven Le Roux.

We have also made a major change to our infrastructure: we no longer rent our servers, we own them. We have direct contracts with the datacenters (DC) where we are present in Paris and with various service providers, with greater control over our network.

We have also taken advantage of this period to initiate a major certification and qualification program of our practices and processes. We have been ISO 9001 certified since last year, and have just passed the audit for ISO 27001, which should be confirmed in the coming weeks. The next steps have already been defined: the health data hosting certification (HDS), followed by the SecNumCloud qualification, which has already begun.

Making all these changes has been a long and sometimes difficult process. We have learned along the way, and sometimes made mistakes. Some of our customers in particular have noticed this since the back to school period, when we carried out a major migration of our entire infrastructure. While we’ll be covering some topics in greater detail in future posts, here’s a first progress report on everything that’s been achieved and what’s gone (more or less) smoothly.

Three new datacenters

Last year we announced the first step in our plan: the installation of new servers, based on the open OCP format and AMD EPYC processors at Green Data in Nanterre. Besides setting up processes and contractual aspects, our goal here was to rethink the way we deploy and manage our servers and part of our network. A project on which we collaborated with NVIDIA/Mellanox.

Thus, all our “Top of Rack” (ToR) switches are not powered by an off-the-shelf operating system, with its myriad features and licenses. We install our own OS, based on Exherbo Linux and the ONIE standard, where we deploy our own software stack. All of this is accelerated by switchdev.

Capri OCP servers used in our Nanterre datacenter

Brought into production in September 2022, this site has enabled us to sustain our growth. But it has also led us to experience some problems, and to learn how to manage and correct them. From time to time, some of our hypervisors ended up in “kernel panic” for no reason, which was corrected with release 6.0 of the Linux kernel. We also had some network instabilities, not on our ToRs but on a component linking our datacenters together, which delayed the ramp-up on this site. This was only resolved before the summer, once everything had been understood and corrected.

Confident in our ability to handle these issues, we decided in December 2022 to duplicate this experience in two new DCs, to replace the two where we were historically present. The contract with our service provider was due to expire around a year later, and we were facing major disagreements over the terms of its renewal. Since then, we’ve been hard at work qualifying and ordering the equipment, selecting sites where it should be installed, and setting it up.

Several months of work led to the installation of these two new datacenters this summer, followed by their interconnection and configuration, with yet another important milestone: we now announce our own IPs there, working both with our long-standing network partner and with transit operators under direct contract.

We’ve also redesigned our entire DNS stack to make it more resilient, with faster resolution and updates. In case of unavailability (e.g. following a DDoS attack), other regions are no longer impacted regarding access to existing records. Each region now has a copy of the entire DNS configuration locally, so it can respond directly. We’ve also revised the record creation and update model, which now takes less than a second.

Migration time

Then came the migration. As simple as it sounds, it was a massive task: moving all our customers from our two historic sites in Paris to our three new sites.

We had to enhance our tooling to provide the necessary level of automation and customer information. But above all, we had to identify everything that had been done manually for some customers over the years.

This led to a preparation phase, which explains the emails and notifications you may have received starting this summer. Each of the customers concerned was able to move their services whenever they wanted, before a deadline by which our teams would take charge of the operation.

We worked closely with a large number of our users to support them during this period, so that their services were (almost) uninterrupted. This work was orchestrated by our Head of Operations, Cédric Biron.

While this went well overall, we have since noticed that in some cases it wasn’t enough, as these alerts sent out several times by e-mail were not always seen. Over the coming weeks, we will be looking into the reasons for this, reviewing and improving our processes.

Load balancers: significant changes for Sōzu (and a few snags)

But since October, we’ve noticed an increasing and unusual number of instabilities on the platform, involving a critical and usually stable brick: our load balancer, Sōzu. Over the past few weeks, our teams have been investigating to understand not only the problem, but the problems encountered by some customers, which occurred in conjunction with the migration but were not necessarily linked to it. These have now all been resolved.

First of all, we took advantage of this period to make a major leap forward for Sōzu, which went from its 0.13.x branch to 0.15.x, with numerous improvements, including its ability to provide us with metrics, prepare for new features such as HTTP/2 support, and so on.

Our original intention was to prepare for some of the problems we expected to encounter with increased traffic loads or growth in encrypted traffic, but also to strengthen the security of the tool.

On this last point in particular, we had initially decided to move from 2048-bit to 4096-bit RSA keys, to provide stronger encryption with no impact on compatibility for our customers. But once this improvement was implemented, the resources consumed and the time taken to process requests exploded. By the time we found the source of the problem and returned to the initial configuration, the damage had been done.

We have since reworked this point and plan to switch to ECDSA keys.Their support is already active in Sōzu, but will only be deployed across our services after an impact study and notification of our customers.

Other changes may have had an impact when, for example, certain headers contained characters that did not meet the standard. During this upgrade, we encountered a “double bug” which took us a long time to understand and correct, and we sometimes experienced hitches in our release processes which may have impacted our customers excessively against our regular quality standards.

We have since strengthened our processes in this area too, both in terms of preparing for the release of new versions and informing customers of new features that could potentially impact them.

The exhaustive monitoring implemented during this period has confirmed over the last few weeks that all issues have been resolved, and that even if there are still a few improvements to be made, no further incidents have been reported.

We’ll now be able to focus on what’s next, including the API rollout of a number of new features for managing domains and access to your applications.

A new stack of logs

Those paying close attention may have noticed that other elements evolved in recent months in a more or less visible way, thanks to the work of our teams to modernize entire parts of our infrastructure.

For example, a new Pulsar cluster has been put into production, enabling us to completely overhaul our logging experience.

For some time now, this functionality has been introduced in our CLI, the Clever Tools, with logs now displayed in a sorted way, color and emoji management, and the ability to filter by date/time. Further improvements are still in progress, and a new console interface is being prepared to give you better access to your application logs.

The same will apply to access logs later on, as an upgrade of Sōzu’s workflow is in progress to handle them with a high performance level ( using Protobuf).

Setting up this new cluster will also enable us to take Pulsar out of its “Beta” status. We’ll be detailing the plan for its general availability as a final version in the coming weeks.

New foundation for our data

Another background topic, barely visible but with real significance: the production release of our first serverless cluster based on FoundationDB.

It took over from HBase, whom we thanked at a farewell ceremony held in our Brest offices on a rainy October evening.

Until then, our metrics and billing data relied on this storage technology, which required around a hundred machines. During discussions with the SenX team, behind the Warp 10 open source project, we realized that they were facing similar problems. We then worked together, which led to Warp 10 version 3.0, enabling us to successfully build a storage backend that worked hand-in-hand with our new cluster.

FoundationDB will be at the heart of some of our new products. It is already being used for demonstrators of our future Key/Value store offers (Redis compatible) and secrets storage (Vault compatible)… among others.

These solutions are currently being tested internally and with some customers. We will gradually open the doors to those interested in testing these products during 2024, with demonstrations planned at the various shows where we will be present.

Better Clever Tools

A cloud offering is nothing without its interfaces. In addition to the continuous improvement work done by our Console team, who have just upgraded their Web Components to v12, we have re-established a team responsible for the evolution and documentation of our CLI, the Clever Tools.

A new deployment process has just been finalized and will be used for their next release, with many new features planned for early 2024. This new structure will enable us to deliver more regular enhancements, in line with customer feedback, and to better manage external contributions (issues, PR, etc.).

Fine-tuning everywhere

We also completely overhauled our website after the summer, with a new interface for our documentation, which will gradually be completely reworked, including guides and tutorials.Our images team also carried out a thorough overhaul of their design process, to simplify updates.

Our team in charge of images has also reviewed the creation process in depth, in order to simplify updating, but more importantly to have a common base for all our runtimes, enabling us to make them more flexible and offer new ones in the months to come. We have also reviewed the organization of our teams and lifecycle management, so that each runtime is now handled by a designated member of staff, who will act as a link between the way we deploy applications and the communities involved.

An increasing part of our APIs are migrating to /v4, for internal use, but also to meet the needs of our customers integrating Clever Cloud into their products and interfaces. Documentation of the new endpoints is underway. This year, we also began supporting Terraform, with a view to going further in the field of Infrastructure-as-Code and the management of such tools in the coming year.

In recent months, we have also delivered preliminary support for Tailscale, a Clever Edge 4G/5G offering to some customers, started to embed others on our WASM-based FaaS offering, and so on.

What’s Next ?

Our plans remain unchanged, and now that migration and restructuring are behind us, we’ll be able to accelerate : keep iterating to improve our platform, make progress on the delivery of products currently in development, deliver a preliminary version of our Networks Groups, prepare the ground for the arrival of IaaS and Kubernetes “by Clever Cloud”, and continue to take the best possible care of our customers, our first partner.

By the way, we’re recording episode 100 of our podcast Message à caractère informatique (MACI) on January 16 at the Palace in Nantes, in public. You are cordially invited to come and chat with our teams about the future of our platform and how it can better meet your needs. Just as we’ll be present at a number of trade shows in the first weeks of 2024.

Our journey to a better Clever Cloud

Three new datacenters

Migration time

Load balancers: significant changes for Sōzu (and a few snags)

A new stack of logs

New foundation for our data

Better Clever Tools

Fine-tuning everywhere

What’s Next ?

À lire également

Deploy Brinjel on Clever Cloud

Simplify the management of Clever Cloud services via Kubernetes with our new operator

Keycloak as a Service: identity management without the complexity