Tuesday, April 26, 2011

The Human Chaos Monkey

I have never worked on a project with an automated Chaos Monkey, but I have had the great pleasure of working with a human Chaos Monkey for many years (I'm sure you know who you are :-)

This particular Chaos Monkey insisted on;

  1. Having everything load balanced/replicated even in the testing environment (both for testing and for making sure it matched the production environment as much as possible)
  2. Making sure that the replicated services worked correctly but killing one instance, waiting for it to fail over then kill it again to make sure it failed back.
  3. Running a short stress test even for minor changes
  4. Running long, preferably over the weekend, load tests for major infrastructure changes
The result was a rock solid infrastructure that had close to zero downtime apart from planned maintenance windows where changes depended on other systems. The outages we had were due to external systems replying slower than required so we implemented a killer monkey that recycled com+ processes taking to long.

Making the process clear upfront is a great help; there is no way your code is going into production until you've passed the Chaos Monkey.

No comments:

Post a Comment