This particular Chaos Monkey insisted on;
- Having everything load balanced/replicated even in the testing environment (both for testing and for making sure it matched the production environment as much as possible)
- Making sure that the replicated services worked correctly but killing one instance, waiting for it to fail over then kill it again to make sure it failed back.
- Running a short stress test even for minor changes
- Running long, preferably over the weekend, load tests for major infrastructure changes
Making the process clear upfront is a great help; there is no way your code is going into production until you've passed the Chaos Monkey.