Handling Incidents: My Trials By Fire

Incidents, postmortems, and sevs: all names for software failure, downtime, or data loss. Neither college nor internships provide experience with such incidents. My very first incident was 2 weeks into my first job—no preparation or training. From there, each incident felt like a dark mark on my coding ability, and over time I accumulated mental load and guilt.

Startups Move Fast and Break Things Fast

Working at a startup is incredible for career growth. I am given tasks across the stack and work with phenomenal engineers. A major consequence of being in a startup, though, is that the codebase is not optimized for developer experience.

I've picked up buggy and incomplete features, owning the resulting incidents. Other times, I've been assigned work that touches fragile features where small changes cause cascading bugs. Perhaps for the founding engineers and other experienced developers these are obvious issues, but they aren't for a new hire.

I thought my team saw me as the engineer who broke production with dumb bugs. Over time I developed impostor syndrome and lost sleep. I became paranoid over my code. There was little time for testing and even less to figure out where bugs might hide.

This fear transformed into anxiety. I was constantly monitoring the on-call channel and sev channels in fear of incidents. If I was going to take down a service or cause a customer-facing bug, I wanted to know as soon as possible. Then I started hating my job—all this in less than a year on the job.

Beyond the Shame, Moving Forward to Growth

It wasn't until a senior engineer and tech lead joined my team that I learned the incidents weren't just my fault. Most of the incidents were the result of accumulated technical debt and untested code finally surfacing. Having experienced engineers struggle alongside me helped me stop blaming myself for the incidents. We all suffered collectively while working toward fixing them, never blaming ourselves for stepping on hidden rakes. The codebase was fragile in many areas and would continue to be so unless we took the time to clean it up.

Roughly a month ago, I went on leave to handle a medical problem. On the first day of leave, some malignant bugs I had introduced finally appeared, causing a large scope of incidents. My teammates were able to handle most of these bugs and take on the brunt of the incidents while I watched from home.

At the same time, I was reading through the Google SRE book and had landed on Chapter 15: Postmortem Culture. This chapter addresses how Google creates and handles postmortems, which is their version of a sev or incident.

There were two major ideas from the chapter that reinforced my view on incidents.

Blameless Postmortem Culture
Strengthening the System

Blameless Postmortem Culture

Blameless postmortem cultures are meant to encourage people to create incidents and postmortems more frequently. To make the system robust and prevent future incidents, there must be people willing to discuss them. When someone feels at fault or fears they will be penalized for bringing an incident to light, fewer people will do so, and these problems will remain.

More often than not, the root cause isn't limited to landing bugs. It includes poor testing culture, unnecessarily rushed timelines, and increasingly complex systems. Some of these can be addressed by improving individual engineers, but certainly a good engineering organization should build safeguards to prevent any one engineer from causing large incidents.

Learning that these safeguards did not exist when I was working on my projects helped me stop carrying the shame of my incidents. Instead of seeing these incidents as failures on my part, I can choose to see them as opportunities to make the system resilient and robust, closing the holes I fell into. Unfortunately, sometimes someone needs to fall into a hole for others to notice it. What matters going forward is patching these holes and preventing the next engineer from falling into them.

Strengthening the System

Handling sevs gave me deep familiarity with our stack. I learned to quickly debug logs and reproduce bugs. These skills are invaluable and have helped me as I tackle new incidents and bugs, helping remediate them sooner. I also developed a nose for fragile parts of the system, the most noticeable ones being backend-only or frontend-only validations, complex functions trying to do multiple types of actions, and low test coverage.

From here, the team lead and I have worked together to develop multiple plans to improve the fragile parts of our system. In this work, I've learned that I enjoy product infrastructure work! I've also owned other incidents unrelated to my work and helped with follow-ups. In improving the current system and becoming more familiar with incidents, I've cleared my mind of the shame. Now I can manage the pressure and leave behind (most of) the guilt.

Every incident is a point of growth. I am lucky to work somewhere where we have the ability to make the system stronger. For that reason, I want to continue to improve the codebase both for myself and for my teammates.

The Google SRE book is a wonderful introduction to the world of Site Reliability Engineering. While I am not particularly interested in becoming an SRE, many of the practices done by SREs are applicable to engineers in startups. I heavily recommend reading it online for free or purchasing a physical copy.