For junior engineers and recent hires, joining the team’s on-call rotation can be a dreaded part of onboarding to a new team. Unfamiliarity with the company's systems and processes makes them anxious about handling potential emergencies effectively and efficiently. This can be particularly challenging on teams that own large distributed systems with significant technical debt and infrequently modified services. Common strategies to address these concerns include assigning maintenance tasks for less frequently modified services, ensuring the entire team gains familiarity with the full scope of the team's domain. Another way to build confidence is by ensuring documentation is often refreshed and readily available. Many teams also have knowledge transfer sessions and potentially reverse-knowledge transfer sessions are another common tool for building team competency with supporting services. In these sessions, new on-call team members explain the service to more experienced engineers. These practices can help ensure team members are comfortable with the responsibilities of being on-call for a given service. 1
Beyond these basic exercises, many engineering teams have found conducting "fire drills" to be an incredibly valuable technique for preparing engineers to join the on-call rotation. In a fire drill, the lead assigns an engineer on the team the responsibility to temporarily degrade services within a non-production environment at a specific time to give the greenhorn team member(s) the opportunity to fix a realistic problem with the services in a safe environment with team support.
This concept can also help determine the feasibility of implementing chaos engineering “experiments,” to better test your system(s) availability and resilience. Chaos engineering is a technique to intentionally break systems to test their resilience and identify potential failures before they occur. It involves running experiments in production environments to see how the system reacts under stress or failure conditions. Chaos engineering is different from traditional software testing, as it focuses on the behavior of the entire system rather than individual components. Major tech companies like Netflix, Amazon, and Google have adopted chaos engineering principles. The core principles of chaos engineering align well with the process of running a fire drill, such as building hypotheses, simulating real-world events, and running experiments in production. In chaos engineering, the first step is to understand your system's steady-state behavior. Then, you hypothesize the effects of various degradative actions on latency, services, platforms, or other system components. This allows you to validate that when the system is in a particular state, it responds with specific actions and performs according to expected metrics. A fire drill involves formulating a baseline understanding of your system(s) steady-state behavior, and then determining a specific degradation to run for the educational purposes of the team. Using this data, the team can learn how to resolve the issue and develop confidence using logs and other tools to assess abnormal behavior and opportunities for service improvement. The team may also decide to build automation to run similar service degradation events automatically after new deployments to non-prod environments in line with chaos engineering principles.
A classic source for degradation that many engineering teams already have prepared are load tests. A great first fire drill can simply be running a set of load tests on an under-scaled deployment of your service(s). A team member can “role play” as an external stakeholder to give the greenhorn engineer the chance to change to practice communication and stakeholder management as the team troubleshoots an issue. This exercise will not only give new or junior developers an opportunity to troubleshoot and familiarize themselves with logs under the mentorship of experienced engineers in a low-stress environment. Additionally, it can reveal shortcomings in the service(s) scalability. These shortcomings can be useful technical debt items for the team. Modifying service configurations to simulate a key expiration or to disable access to a third-party system are other trivial degradations involving infrequently modified code or configurations that may be difficult for the inexperienced to troubleshoot. For a more sophisticated fire drill exercise, consider proxying 3rd party API calls behind a proxy service configured to introduce additional latency. However, in our experience simply introducing sudden load to under-provisioned services by harvesting traffic patterns from logs or initially pushing a malfunctioning build to a test environment can be an excellent learning experience that builds team confidence without requiring excessive preparation.
Having a team member play the role of an external stakeholder and practicing designating a person to handle comms during a major outage can help the team feel prepared to effectively handle a high-profile service impairment in a professional and effective manner. Communicating updates to stakeholders can be an intimidating aspect of on-call duties for inexperienced engineers and even senior developers without a background in operations work or experience supporting customer-facing services.
If your team supports customers directly, a team member can also join the fire drill call to roleplay a customer to create a more realistic practice on-call experience. After the completion of the fire drill scenario, the team should conduct a post-mortem following DevOps and SRE best practices to review and collect feedback, identify processes in need of improvement, and reflect on the experience. 2
A fire drill can reveal opportunities for the team to improve their documentation and logging or logging dashboards. A fire drill can also involve members of the team roleplaying as stakeholders on external teams in order to allow new or experienced engineers to practice coordinating the remediation of an issue with those external resources. It is important to have a degree of surprise to the fire drill so engineers put themselves into the mindset of a real issue. However, the team should be notified beforehand so they are aware of the impending exercise and understand the rules of engagement and what to do when stuck. Senior engineers should be directed to be passive, nudging less experienced engineers when prompted by them but allowing them to lead as so to ensure they gain experience and confidence in independently resolving issues.
To summarize, a fire drill is a great exercise for engineering teams supporting high availability applications with an influx of new engineers, and inexperienced engineers or as a way to share institutional knowledge in a fun and engaging manner that paves the way for the implementation of chaos engineering practices. 3
Resources
1 https://sre.google/workbook/incident-response/
2 https://sre.google/sre-book/postmortem-culture/
3 https://principlesofchaos.org
Author Bios
Zach Phillips-Gary is a Senior Site Reliability Engineer at Jahnel Group, Inc., a custom software development firm based in Schenectady, NY. At Jahnel Group, we're passionate about building amazing software that drives businesses forward. We're not just a company - we're a community of rockstar developers who love what we do. From the moment you walk through our door, you'll feel like part of the family. To learn more about Jahnel Group's services, visit jahnelgroup.com or contact Jon Keller at jkeller@jahnelgroup.com