ARTICLE

Chaos Engineering (for) People

This article explores how you can apply Chaos engineering principles to make your team better.

Teams as distributed systems

What’s a distributed system? Wikipedia defines it as “a system whose components are located on different networked computers, which communicate and coordinate their actions by passing messages to one another” (https://en.wikipedia.org/wiki/Distributed_computing). If you think about it, a team of people behaves like a distributed system, but instead of computers we’ve individual humans doing things and passing messages to one another.

  • Full-stack Python development. For the backend receiving the queries about available tickets as well as the purchase orders, this also includes packaging the software and deploying it on Linux VMs. Basic Linux administration skills are required.
  • Front-end, JavaScript-based development.
  • Design. Providing artwork to be integrated into the software by the front-end developers.
  • Integration with third-party software. Often, the airline can sell a flight operated by another airline, and the team needs to maintain integration with other airlines’ systems. What it entails varies from case to case.
Figure 1. Venn diagram of skills overlap in our example team
Figure 2. Individuals on the Venn diagram of skills’ overlap

Finding knowledge single points of failure: “Staycation”

To see what happens to a team in the absence of one member, the chaos engineering way of verifying it is to simulate the event and observe how they cope. The most lightweight variant is to nominate one person and ask them to not answer any queries related to their responsibilities and work on something other than what they would normally work on for the day. Hence, the name staycation. It’s a game and should an emergency arise, it’s called off and all hands are on deck.

  • You can tell the other team members about the experiment, or not. Telling them gives them an advantage because they can proactively think about the things they won’t be able to resolve without the person on staycation. Telling them only after the fact is closer to a real-life situation, but might be seen as distraction. You know your team, so do what you think works best.
  • Choose your timing wisely. If the team is working hard to meet a deadline, they might not enjoy playing games that eat up their time. Or, if they’re competitive, they might like it and with the higher number of things going on there might be more potential for knowledge sharing issues to arise.

Misinformation and trust within the team: “Liar, liar”

In a team, information flows from one team member to another. There needs to be a certain amount of trust between members for effective cooperation and communication, but also a certain amount of distrust in order to double-check and verify things, instead of taking them at face value. After all, to err is human.

  • The liar’s acting skills are useful here. If they can keep it up for the whole day, without spilling the beans, it should have a pretty strong “wow” effect with other team members.
  • You might want to have another person on the team know about the liar, to observe and potentially step in if they think the situation might have some consequences they didn’t think of. At a minimum, the team leader should always know about this!

Bottlenecks in the team: “life in the slow lane”

The next game, “life in the slow lane,” is about finding who’s a bottleneck within the team in different contexts. In a team, people share their respective expertise to propel the team forward. but everyone has a maximum throughput of what they can process. Bottlenecks form, where some team members need to wait for others before they can continue with their work. In the complex network of social interactions, it’s often difficult to predict and observe these bottlenecks until they become obvious.

  • Going silent when others are asking for help is suspicious and might make you uncomfortable, and can even be seen as rude. Responding to the queries with something along the lines of “I’ll get back to you on this, sorry I’m busy with something else right now” might help greatly.
  • Sometimes resolving found bottlenecks might be the tricky bit. There might be policies in place, cultural norms or other constraints to take into account, but even knowing about the potential bottlenecks can help planning ahead.
  • Sometimes, the manager of the team is a bottleneck in some things. It might require a little bit more of self-reflection and maturity to react to that, but it can provide invaluable insights.

Testing your processes: “inside job”

Your team, unless it was formed earlier today, has a set of rules to deal with problems. These rules might be well-structured and written down; might be tribal knowledge in the collective mind of the team; or as is the case for most teams, somewhere in between the two. Whatever they are, these “procedures” of dealing with different types of incidents should be reliable. After all, this is what you rely on in the stressful times. How do you think we could test them out?

  • Pick the inside group wisely. You might want to let the stronger people on the team in on the secret, and let them “help out” by letting the other team members follow the procedures to fix the issue.
  • It might also be a good idea to send some people to training or a side project, to make sure that the issue can be solved even with some people out.
  • Double-check that the existing procedures are up to date before you break the system.
  • Take notes as you observe the team react to the situation. See what takes up their time, what part of the procedure is prone to mistakes, who might be a single point of failure during the incident.
  • It doesn’t have to be a serious outage. It might be a moderate severity issue, which needs to be remediated before it becomes serious.

Follow Manning Publications on Medium for free content and exclusive discounts.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store