Skip to content

Christophe Rochefolle Randrianandrasana

Experienced IT executive providing tech & organization to improve #quality & #agility of IT systems, #ChaosEngineering fan ; #LGBTQ+ #DiversityInclusion

Menu
  • Home
  • About me
  • Values
  • Books
  • Conferences
  • Readings
  • Français
Menu

What is Chaos Engineering?

Posted on 9 December 201710 April 2020 by Christophe ROCHEFOLLE

Chaos Engineering is an emerging discipline that aims to test the solidity of the socio-technical infrastructure in order to always better preserve the quality of service. This practice experiments with gaps and weaknesses in applications and infrastructure on a distributed system. Complexification and automation of systems make this practice more and more important to maintain the user experience. Experienced for seven years by pure-players like Netflix, it is structured around dedicated processes and tools.

The question posed by this discipline is: How close is your system to the precipice and can sink into chaos?

At any moment, something, somewhere, breaks down. It is no longer a matter of crossing your fingers so that it does not happen, but of building the resilience necessary in our systems to tolerate failures, and of ensuring that this resilience is operational and reliable in production.

And yet, we test!

We have all implemented all the tests necessary for the quality of our applications, and to guarantee compliance with operability standards, to achieve the required level of confidence for the increasingly frequent deployments of our developments.

However, one of the most critical issues in testing practices is the representativeness of non-production environments.

We all face a harsh reality: it is almost impossible and costly to keep up to date.

The deployment of Agile & DevOps approaches is making the subject more complicated by making it possible to deliver more and more frequently, up to several times a day.

Besides, non-production environments are and will be less and less representative when you are in the Cloud, and your infrastructure adapts to your traffic by auto-scaling.

The increasingly advanced customization of our applications – therefore, with data-based behaviors – confronts us with customer journeys that adapt in real-time.

And it’s going to become even worse with the deployment of Artificial Intelligence.

It, therefore, becomes necessary to experiment in production in order to have the right level of confidence in our systems.

Yes, you read that right, we are talking about doing production tests:

  • Experiment to test our systems: rather than waiting for a failure, we want to introduce one to test the resilience of the system,
  • Experiment to learn: we don’t want to generate chaos for fun, but rather to discover unknown weaknesses in our systems.

It is, however, about experimenting in production on a stable and efficient system.

It is not a game, but real life, with potentially significant human and financial impacts.

The Chaos engineer is not a mad scientist, he is an explorer looking for knowledge of the system he is studying.

The Road to Hell is Paved with Good Intentions……………. – Visions and ...

What is an experiment?

In the book presented at the end of the article, the Chaos engineers at Netflix Technology Blog proposed this protocol:

  1. Define what is the question that we want to ask the system: do we want to test the resilience of a component, an application, an organization?
  2. Define the scope of the experience: is it all or part of the production? is it only the technical environment alone or also include human interventions (monitoring, operation, support),
  3. Identify precisely the metrics that will validate the experience and possibly stop it instantly in the event of a critical impact,
  4. Communicate, prevent the organization of the existence of the experiment – to avoid escalation in the event of a critical incident
  5. Perform the experiment
  6. Analyze the results, put in place any necessary action plans
  7. Expand the scope for the next experiment.

However, doing a test once will reassure you about the resilience of your system. Still, with permanent changes, the only way to sleep well at night is to automate the experience so that it is carried out continuously in order to follow the evolution of the system.

Example of experimentation

  • Set up Chaos Monkey – simulate failures in a real environment and verify that the computer system continues to function.
    The experiment consists of regularly choosing instances randomly in the production environment and deliberately decommissioning them. By repeatedly “killing” instances at random, we ensure that we have correctly anticipated the occurrence of this type of incident by setting up an architecture that is sufficiently redundant for a server failure to have no impact on the service provided…
    Set up by Netflix in early 2011, they joined the Simian Army which present other types of interruption:
    – Chaos Kong (which brings down a complete Amazon availability zone),
    – Latency Monkey (which allows testing tolerance for loss performance of an external component),
    – Security Monkey (which dismisses all instances that present vulnerabilities), etc.
  • Set up a Gameday: to test the resilience of the organization and train it to react in the event of incidents, Jesse Robbins, ex- “Master of Disaster” at Amazon, set up the concept of Gameday which consists of simulate failures to test the teams’ ability to react and return to a nominal situation.
  • Set up a Days of Chaos, a variation of GameDay by OUI.sncf for all IT teams and aimed at training in the detection, diagnosis, and resolution of production incidents.
  • Set up Disaster Recovery Testing on a recurring basis.

To know more

I recommend this free ebook from O’Reilly: Chaos Engineering, Building Confidence in System Behavior through Experiments

Share :

  • Click to share on X (Opens in new window) X
  • Click to share on Facebook (Opens in new window) Facebook

Articles similaires

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Follow me

Tweets

RT @SNCFConnectTech Retour sur notre #Hackathon interne 2023 🏆 Félicitions à nos 160 #DigitalMobilityChangers participants et bravo aux 3 équipes gagnantes qui ont séduit le jury grâce à leurs projets ! A très vite pour découvrir nos nouvelles solutions au services des mobilités durables🚀 pic.twitter.com/cl5J…

About 2 years ago from Christophe Rochefolle @crochefolle@fosstodon.org's Twitter via Twitter for iPhone

RT @LADN_EU Elles sont chimistes, physiciennes, mathématiciennes, et on ne les connaît pas. Pour donner de la visibilité aux oubliées de l'histoire, Jess Wade a rédigé 1750 biographies sur Wikipédia. ladn.eu/actualite/je…

About 2 years ago from Christophe Rochefolle @crochefolle@fosstodon.org's Twitter via Twitter for iPhone

RT @thoughtworks Beyond architecture and technology, #DataMesh transformation requires a change in how individuals operate and how teams are organized. Learn about our "Show-Shift-Scale" approach: thght.works/3J2hV5q pic.twitter.com/8j4y…

About 2 years ago from Christophe Rochefolle @crochefolle@fosstodon.org's Twitter via Twitter for iPhone

RT @SNCFConnectTech 🔙 Retour sur #AgiLille2023 ! Nos #DigitalMobilityChangers, entourés de 700 experts en agilité, y étaient pour échanger et présenter nos méthodes qui permettent d’être + réactifs et collaboratifs dans la gestion de nos projets. Merci à tous ceux passés sur notre stand. 👏 pic.twitter.com/69W4…

About 2 years ago from Christophe Rochefolle @crochefolle@fosstodon.org's Twitter via Twitter Web App

RT @SNCFConnectTech En direct du @breizhcamp où nos experts @FrancoisN0 & @jburet prennent la parole pour faire un "ptit-REX" 🦖 sur l’approche #Monorepo. Popularisée par Google, Microsoft, Facebook & Twitter, elle propose un unique dépôt pour tous les applicatifs ! pic.twitter.com/i7Eb…

About 2 years ago from Christophe Rochefolle @crochefolle@fosstodon.org's Twitter via Twitter Web App

RT @anttiviljami My problem with the software dev profession in 2023 pic.twitter.com/15jl…

About 2 years ago from Christophe Rochefolle @crochefolle@fosstodon.org's Twitter via Twitter for iPhone

Trump continue de sévir à travers ses nominations à la Court Suprême… twitter.com/popcrave…

About 2 years ago from Christophe Rochefolle @crochefolle@fosstodon.org's Twitter via Twitter for iPhone

RT @YannisHaismann « Ça t'a pris une demi-journée pour corriger ce bug ? » « Tout ça pour produire 2 lignes de code ? » [THREAD] Ça n'a rien d'anormal... En faire j'ai faits plein de chose :

About 2 years ago from Christophe Rochefolle @crochefolle@fosstodon.org's Twitter via Twitter for iPhone

Louer un Airbnb pour organiser des visites pour arnaquer des étudiants cherchant un logement - on atteint un niveau inouï. Merci @a_berut pour ce partage twitter.com/a_berut/… pic.twitter.com/Sxd2…

About 2 years ago from Christophe Rochefolle @crochefolle@fosstodon.org's Twitter via Twitter for iPhone

RT @SNCFConnectTech 🤩Félicitation à nos #DigitalMobilityChangers qui ont reçu le prix 🧏‍♂️Digital Accessibility Mission Ce prix vient récompenser le travail mené par toute une équipe pour rendre #SNCFConnect accessible au plus grand nombre. Bravo également aux autres gagnants des #DXAwards. twitter.com/ContentS…

About 2 years ago from Christophe Rochefolle @crochefolle@fosstodon.org's Twitter via Twitter for iPhone

Posts

  • Chaos (6)
  • Conferences (1)
  • DevOps (3)
  • Management & Leadership (3)
  • Quality & Excellence (1)
  • Values (4)

Abonnez-vous

Join 4 other subscribers

Tweets

Privacy & Cookies: This site uses cookies. By continuing to use this website, you agree to their use.
To find out more, including how to control cookies, see here: Cookie Policy
© 2025 Christophe Rochefolle Randrianandrasana | Powered by Superbs Personal Blog theme