It is 2:17 AM on a Tuesday. You are asleep. Then your phone buzzes — once, twice, then a third time in rapid succession. You pick it up. The first message is from an automated monitor. The second is from an on-call engineer. The third is from your VP, and it just says: "Call me."
Something has broken. Not a small thing. Something big enough that your VP is awake and typing on their phone at 2 AM. Payments are failing, or the user database is corrupted, or the entire platform has been down for 40 minutes and a major customer is threatening to cancel. Whatever it is, it is real, it is urgent, and every minute that passes is costing the company money, trust, or both.
This chapter is about that moment. What do you do first? How do you lead a team under that kind of pressure? How do you communicate to leadership without adding to the chaos? When does an incident stop being a firefight and start being a project — and how do you manage that transition without dropping the ball on either?
Most engineers learn crisis execution the hard way: by being thrown into one with no framework, surviving by instinct, and walking away shaken. This chapter gives you that framework before the crisis arrives, so that when it does, you recognize the situation and know exactly what to run.
A crisis is not a different kind of work. It is the same work — coordination, communication, decision-making — done faster, with less information, and with a hundred people watching.
The Moment Everything Breaks: What Actually Happens to Your Brain
Before we talk about process, we need to talk about biology. When a crisis hits, your brain does something predictable and deeply unhelpful: it floods with adrenaline, narrows your attention to the most visible symptom, and pulls you toward action — any action — because action feels like control.
This is called the stress response, and it evolved to help you run from predators. It is spectacularly bad for diagnosing distributed systems failures.
The stress response causes three specific thinking errors that kill incident response:
Error 1: Anchoring on the First Diagnosis
Someone says "looks like the database is slow" in the first five minutes, and from that point forward, the entire team investigates the database — even when evidence starts pointing elsewhere. The first hypothesis, spoken aloud, becomes the shared assumption. Everyone builds on it. Nobody challenges it. Thirty minutes later, you realize the database is fine and the real problem is a memory leak in a service you never looked at.
This is anchoring. It happens in every crisis, and it is invisible while it is happening because the first hypothesis always sounds plausible — that is why someone said it.
Error 2: Action Bias
When something is broken and people are watching, the instinct is to do something — restart a service, roll back a deployment, increase capacity. Doing nothing feels irresponsible even when doing nothing is exactly right. The result is a team taking actions they haven't fully thought through, which changes the system's state in ways that make the original problem harder to diagnose.
Every untested action during a crisis is a variable you just added to an already-confusing equation.
Error 3: Everyone Becomes a Responder
Senior engineers flood the incident channel. Everyone has a theory. Everyone is running queries, restarting pods, making suggestions. The signal-to-noise ratio drops toward zero. The person who is supposed to be leading the diagnosis now spends half their cognitive energy parsing Slack messages instead of thinking.
This is the most common failure mode in technically strong teams. They are full of smart people who all want to help, and that collective desire to help creates a coordination disaster.
The Real Danger
The smartest person in the room is usually the most dangerous during a crisis. They have the most confident hypotheses, the most direct access to production systems, and the most difficulty sitting still while others work. If that person does not have a formal role, they will fill every vacuum — and fill it with noise.
The antidote to all three errors is the same thing: structure. Specifically, a clear role separation and a clear decision protocol that you put in place before the chaos starts, not after.
The Incident Commander: One Person Owns the Room
The most important structural decision you can make at the start of any serious incident is this: one person is the Incident Commander (IC), and everyone else is in a supporting role. Not two people. Not the most senior person automatically. One person, explicitly designated, whose job is fundamentally different from everyone else's.
This idea comes from emergency response — firefighters, air traffic controllers, hospital trauma teams. These fields figured out decades ago that when stakes are high and time is short, diffused leadership is more dangerous than centralized leadership, even imperfect centralized leadership.
What the IC Does (and Doesn't Do)
Here is the most counterintuitive part: the Incident Commander does not investigate the incident. They do not look at dashboards, write queries, or try to find the root cause. That is the investigator's job.
The IC does one thing: they manage the process. They track what is being investigated, what conclusions have been reached, what actions have been taken, and what the current best theory is. They assign work to specific people. They make the call when a decision point is reached. They communicate with stakeholders. They keep the response from collapsing into chaos.
OWNS
The timeline — what happened and when, in the order it was discovered
This is the single most valuable artifact of any incident response. The IC keeps it updated in real time.
OWNS
Work assignment — who is investigating what, right now
At any moment, the IC should be able to say: "Alice is on the database angle, Bob is looking at the service logs, Carlos is checking the deployment history."
OWNS
The decision to act — when someone proposes a remediation, the IC decides whether to proceed
Not the on-call engineer, not the most senior person, not whoever shouts loudest. The IC.
OWNS
External communication — updates to leadership, customers, and other teams
Nobody else sends status updates unless the IC delegates this explicitly to a Comms Lead.
DOES NOT OWN
The technical investigation itself
The IC stays out of the weeds so they can maintain a view of the whole system. The moment the IC starts running queries, they stop managing the response.
DOES NOT OWN
Personal theories about root cause
The IC synthesizes others' findings. They do not advocate for their own hypothesis — this would unconsciously bias the investigation toward confirming it.
Who Should Be the IC?
Not necessarily the most senior engineer. The IC role requires a specific skill set: the ability to maintain a calm, structured view of a situation while everyone around you is anxious and scattered. Some senior engineers are excellent at this. Others, especially those who built the system and have strong technical opinions about it, are better deployed as investigators.
The best IC is the person who can most easily resist the urge to jump into the technical work. That sounds like a low bar. In practice, it is one of the hardest things to do in a crisis.
A Pattern You Will Recognize
A team has a major outage. The most senior engineer — call him David — is the de facto IC. For the first ten minutes, David runs the incident well: assigns people to investigate different areas, keeps a timeline, stays calm.
Then someone mentions the database. David built that database three years ago. He knows its failure modes better than anyone. He immediately opens a terminal and starts looking at query plans.
For the next 25 minutes, David is an investigator. Nobody is coordinating the other investigators. Two people start working on the same thing without knowing it. A third person finds something important and posts it in Slack, but nobody acknowledges it because David — the person who would normally synthesize findings — is staring at query plans.
The incident takes 45 minutes longer to resolve than it should have, because the team lost its coordinator the moment a familiar problem appeared.
The rule is simple: if you become the IC, you give up your terminal. You give up your opinions about root cause. You become the person who manages the room, not the person who solves the puzzle. If you cannot do that — if the urge to investigate is too strong — hand the IC role to someone else and become an investigator. Both roles are valuable. Mixing them is what causes disasters.
The Three-Track Response: Running Parallel Threads Without Losing Them
The rookie mistake in incident response is treating it as a sequential process: first diagnose, then decide on a fix, then implement the fix, then verify. This feels organized. It is also very slow, because in a real incident, you rarely have clean handoffs. You are getting new information continuously, your diagnosis is being revised in real time, and you often have multiple possible remediations you want to evaluate simultaneously.
Expert incident commanders run three tracks in parallel from the moment the severity of the incident is clear.
Track 1: Diagnosis
This is what most people think of as "the incident response." It is the technical investigation — looking at metrics, logs, traces, recent deployments, code changes, and anything else that might explain the symptoms.
The key discipline in diagnosis is structured skepticism. The IC should be constantly asking: "What would prove the current theory wrong?" Not to be difficult, but because the fastest path to the correct diagnosis is the one that eliminates wrong answers quickly rather than building elaborate cases for plausible-sounding ones.
Good investigators run down their hypothesis until they find evidence that either confirms it or rules it out, then report back. The IC synthesizes across all active investigations and updates the shared current best theory accordingly.
Track 2: Mitigation
Even before you know the root cause, you should be asking: is there anything we can do right now to reduce the impact? Not fix the problem — mitigate it. Reduce the blast radius. Protect the highest-priority users. Shed load from the affected service.
Mitigation and diagnosis are different problems and can often be pursued in parallel. You do not need to know why the database is slow to decide to route read traffic to a replica. You do not need to know the root cause of a memory leak to decide to restart the service every 30 minutes as a temporary measure.
The Mitigation Principle
Mitigation reduces user pain while your diagnosis continues. It is not a substitute for fixing the root cause — it is the thing that buys you time to find and fix the root cause without the entire company watching every minute tick by. Always separate the question "what can we do right now to reduce impact?" from the question "what do we need to fix permanently?"
Track 3: Communication
While diagnosis and mitigation are running, someone must be managing the information flow. This is the track that most technical teams neglect, and it is the one that often determines whether leadership loses confidence in the team.
In a significant incident, multiple people outside the immediate response team need information:
- Your direct manager needs to know the severity, your current understanding, and your plan so they can shield you from distractions and brief their manager.
- VP-level leadership needs a brief, factual status at regular intervals — what is affected, what you know, what you are doing, when you expect to know more.
- Customer-facing teams (support, sales, account management) need information in plain language so they can respond to customer inquiries without making things worse by speculating.
- Other engineering teams whose systems depend on yours may need to know they should expect degraded service and plan accordingly.
If communication is not owned explicitly, it either does not happen (leadership gets anxious and starts asking questions that interrupt your investigators) or it happens inconsistently (three different people send three different status updates with three different levels of severity, and now leadership does not know which one to believe).
The First 15 Minutes: A Precise Protocol
The first 15 minutes of a serious incident are the most chaotic and the most important. What you do — and specifically what you set up — in those 15 minutes determines how the next several hours go. Here is an explicit protocol, not as a rigid script, but as a mental checklist that prevents the most common failures.
0–2 min
Establish severity and designate IC
Someone — it might be you, it might be the on-call — makes the call: is this P0 (total service outage), P1 (major functionality broken), or P2 (degraded)? Severity determines how many people you pull in and how fast you communicate up. Then say explicitly who the IC is. "I'm taking IC." or "Sarah, you have IC." This five-second act prevents the next 30 minutes of implicit confusion.
2–5 min
Open a dedicated coordination channel
Create a Slack channel or a dedicated incident thread. Everything goes there. Not in DMs, not in random channels, not in the general engineering channel. One place. Anyone who joins the response later can scroll up and get full context. The IC posts the first status message immediately: timestamp, what is known, severity, who is working it.
5–8 min
Start the timeline document
A shared doc, open to everyone in the response. The IC or a dedicated scribe starts recording: what was the first alert, what was the first symptom, what changed recently (deployments, config changes, traffic changes), what is the current state of the system. This document will save you during the post-mortem. More importantly, having it forces the IC to synthesize everything, which immediately surfaces gaps in the current understanding.
8–12 min
Assign specific investigation roles
Not "everyone look at what might be causing this." Instead: "Alice, you're on service metrics and recent deployments. Bob, you're on database query performance and connection counts. Carlos, you're on dependent services — check whether anything upstream changed." Explicit assignments eliminate the duplicated effort and the uncovered ground that happens when everyone investigates whatever they feel like investigating.
12–15 min
Send the first external status update
To your manager, and to any customer-facing teams. This first message does not need to include root cause — you do not know it yet. It needs to include: what users are experiencing, when you first detected it, who is working on it, and when you will send the next update. The regularity of updates matters more than the depth of information in any single update.
Notice what is not in the first 15 minutes: implementing fixes. You will be tempted. The adrenaline will push you toward action. Resist. The cost of acting on a wrong diagnosis is high — you change the system's state, the problem either persists (now harder to diagnose) or gets worse, and you have consumed the team's execution budget on a dead end. Fifteen minutes of disciplined diagnosis almost always pays for itself in time saved on remediation.
Communication During a Crisis: What to Say, When, and to Whom
Communication in a crisis is a skill that looks easy until you try it. The difficulty is not finding the right words — it is knowing what level of detail to give to which audience, how to convey confidence without making promises you cannot keep, and how to send updates at a pace that keeps people informed without becoming the thing that interrupts your team every ten minutes.
The Audience Stack
Think of your communication during a crisis as three distinct audiences, each with different needs and different tolerance for technical depth.
| Audience |
What They Need |
Update Frequency |
Right Level of Detail |
| Response team |
Current best theory, work assignments, decisions made, timeline |
Continuous (the incident channel) |
Full technical depth — this is your team, they need everything |
| Your manager / Engineering leadership |
Severity, current state, team's plan, honest estimate of resolution time |
Every 20–30 min for P0, every 45 min for P1 |
Technical summary — they understand systems but don't need query-level detail |
| VP / Executive layer |
Business impact, whether it is getting better or worse, and one clear next action |
Every 30–60 min, only if they ask or it's a major P0 |
Non-technical plain language — "users cannot check out" not "the payment service is returning 503s" |
| Customer-facing teams |
What users are experiencing, status, honest ETA, scripted customer response |
At each major status change |
Plain language only — no technical speculation, nothing that could become a public statement without review |
The Anatomy of a Good Status Update
A good status update during a crisis has five elements, and they should appear in this order:
1
What is affected right now
Not what broke originally. What is the current user-facing impact. "Checkout is down" or "Checkout is degraded — 40% of users are affected" or "Checkout is restored but we are still monitoring."
2
What you know about why
One sentence, clearly marked as preliminary if it is not confirmed. "We believe the issue is a memory leak in the order service introduced in the 3pm deployment." Or, honestly: "Root cause is still under investigation." Do not speculate in writing to a wide audience.
3
What you are doing about it
"We have rolled back the 3pm deployment and are monitoring." Or: "We are rolling back now, expect resolution in 10 minutes." Or: "We are still investigating and have not yet identified a remediation path."
4
When you will update next
A specific time, not "soon." "Next update in 20 minutes or sooner if status changes." This one commitment does more for leadership confidence than almost anything else. It means they do not need to ask you — they know an update is coming.
5
Who to contact with questions
One person. Not "the team." Not "anyone in this channel." One specific name. This is usually the IC or the comms lead, and it exists to prevent your investigators from being interrupted by questions from people outside the response.
The Trap of Over-Communication
There is a version of overcommunication that feels responsible and is actually harmful. It happens when the IC sends highly detailed technical updates to non-technical stakeholders in an attempt to show that the team is working hard and on top of things. The non-technical stakeholder reads a wall of technical text, cannot parse it, and concludes that the team is confused and panicking.
More words do not equal more confidence. Simpler, shorter, more frequent updates almost always build more trust than long, detailed, irregular ones. If a VP needs to know more, they will ask. Your job is to keep them informed enough that they are not anxious, not to give them a dissertation on distributed systems debugging.
What to Do When You Do Not Know the Answer
The hardest communication moment in a crisis is when leadership asks "when will this be fixed?" and you genuinely do not know. The wrong answer is to give a number you made up to end the conversation. That number will be quoted back to you, it will almost certainly be wrong, and it will erode your credibility.
The right answer is honest and structured: "We do not have a root cause yet, so we cannot give a reliable ETA. What I can tell you is: if our current theory is correct and we can roll back the deployment, we expect resolution within 15 minutes. If the rollback does not help, we are looking at a longer investigation. I will update you in 20 minutes with either a resolution or a clearer picture of the timeline."
This is not hedging. It is accurate communication under genuine uncertainty. Most leaders, when they understand that you are being honest about what you know and don't know, appreciate it more than false confidence.
When the Incident Becomes a Project
Some incidents resolve in 30 minutes. A bad deployment, a rollback, and everything is back to normal. But some incidents are more serious. The root cause turns out to be architectural. The fix requires a database migration, or a significant code refactor, or a coordinated change across multiple services. The incident is "mitigated" — users can limp along — but the real problem is not fixed, and fixing it will take days or weeks.
This is the moment when the incident becomes a project. And it is one of the trickiest transitions in engineering leadership, because the tools and habits of incident response — urgency, centralized command, parallel fast execution — are almost the opposite of the tools and habits of project execution — careful planning, distributed ownership, structured milestones.
Recognizing the Transition Point
The transition from incident to project happens when:
- The system is mitigated but not fixed, and the fix requires more than a few hours of engineering work
- The root cause requires a structural change — a new component, a migration, a refactor — not a quick patch
- You need multiple engineers across multiple teams working in a coordinated way over multiple days
- The work cannot be managed by the on-call rotation alone
The classic mistake is to continue treating project work as incident work after this transition. The team stays in "war room mode" — high urgency, everyone pulled in, constant status updates, no planning — for days. People burn out. Progress is slow because there is no actual plan, just a shared anxiety about an unresolved problem. The work gets done eventually, but it costs two or three times what it should have.
Making the Transition Explicit
The IC's last job in an incident is to declare the transition: "We have mitigated the immediate impact. The root cause fix is now a tracked project. Here is what the project looks like, here is who owns it, and here is how we will communicate progress."
This declaration does several things. It tells the team that the war-room mode is over and they can return to their normal work. It tells leadership what the path to full resolution looks like and when they can expect it. It frees the person who led the incident response — usually not the right person to own a multi-week project — to hand off to whoever will actually drive it to completion.
The Handoff Contract
When you hand off from incident to project, three things must be in the handoff document: (1) the current state — exactly what is mitigated, what is not, and what the residual user impact is; (2) the known root cause and the proposed fix; (3) the risks of the proposed fix — what could go wrong during the repair that would require another incident response. Without these three things, the project owner is starting blind.
Managing a Team Under Pressure: The Human Side
Everything we have discussed so far is process. Process is necessary but not sufficient. The other dimension of crisis leadership is managing people — specifically, managing people who are stressed, sleep-deprived, and under scrutiny in ways that their normal work does not produce.
The Pressure Cooker Effect
Under pressure, people's best traits and worst traits both get amplified. Your most careful engineer becomes either more carefully methodical (good) or more paralyzed by uncertainty (bad). Your most action-oriented engineer becomes either faster and more decisive (good) or more reckless and less thorough (bad). You will need to read each person on your team individually during a crisis and adjust how you interact with them accordingly.
Some specific people problems that come up in crises and how to handle them:
The Brilliant Hypothesis Machine
This person generates a new theory about root cause every three minutes. Each theory sounds plausible. None of them are followed through before the next one appears. They are not being unhelpful on purpose — they are just thinking out loud at high speed. The fix is to direct them: "Pick your best theory and spend the next 10 minutes looking for evidence that disproves it. Report back." Give them a specific task that channels the energy rather than trying to quiet it.
The Frozen Expert
This is the person who knows the affected system better than anyone else and is paralyzed by that knowledge. They can see every possible failure mode simultaneously and cannot commit to investigating any of them because they are all plausible. The fix is to start them somewhere specific: "Start with the last deployment. Was anything different about it? Just tell me that one thing." Start small and narrow. Once they are moving, they usually find their footing.
The Status Broadcaster
This person posts every finding, every query result, every half-formed thought into the incident channel, creating a flood of information that overwhelms the channel. The fix is gentle but direct: "Keep your working notes in a separate thread or document and only post to the main channel when you have something conclusive or a clear question for the group." Most people do not realize they are doing this.
The Executive Who Joins the Channel
When a VP or Director joins the incident channel, every engineer in the channel immediately starts performing for that audience instead of focusing on the problem. You will see message quality improve, certainty increase, and investigation pace slow — all bad things. The best response, handled by the IC with tact, is to direct the executive toward a separate status thread: "I'll keep you updated in the exec thread every 15 minutes. This channel is for technical coordination — it will be noisy and confusing." Most executives are relieved to not have to parse technical Slack.
Protecting Your Team's Energy
Long incidents — anything over two or three hours — require active energy management. People who have been in crisis mode for four hours are not as sharp as they were in the first hour. They make mistakes. They miss things. But they will not tell you this, because stepping back feels like abandoning the team.
The IC's job includes forcing rotation. After 90 to 120 minutes, each investigator should hand off to someone fresh if possible, document their current findings clearly, and take 20 minutes away from the incident. The 20 minutes lost in the handoff is more than recovered in the sharper thinking of the fresh investigator.
For very long incidents — ones that run through the night or across multiple days in mitigation mode — you need an explicit incident rotation schedule. Who is covering which hours? What is the handoff protocol? What does the night-shift responder need in writing before the day-shift responder goes to sleep? Treating this informally, as people have always done, leads to coverage gaps, missed escalations, and engineers who are running on four hours of sleep and making structural-level decisions about production systems.
The Post-Mortem: Not a Blame Session, Not a Formality
The post-mortem is the final act of the incident, and it is the one most frequently done badly. It is done badly in two very different ways, and interestingly, both failures come from the same root cause: discomfort.
The first failure is the blame session. Someone caused the outage — they deployed the bad code, or they missed the warning signs, or they made the wrong call during the response. The post-mortem becomes, explicitly or implicitly, a process of identifying and punishing that person. The lesson "learned" is "this specific person made this specific mistake." Nothing structural changes because the problem has been attributed to human error and the human has been identified.
The second failure is the formality. Everyone knows blame sessions are bad, so the post-mortem swings to the opposite extreme: nothing is anyone's fault, everything is a "systemic issue," and the document concludes with vague action items that nobody owns and nobody ever follows up on. The lesson "learned" is nothing at all. The same incident will happen again in nine months.
A post-mortem that does not produce specific, owned, time-bound action items is not a post-mortem. It is a ritual that makes the team feel like they did something without actually doing anything.
The Right Frame: Systems Thinking Applied to the Incident
The core discipline of a good post-mortem is asking "why" at every level until you reach something structural — something about the system, the process, the tooling, or the environment — that can be changed. The goal is not to find the person who made a mistake. The goal is to find the conditions that made that mistake easy to make, and then change those conditions.
A classic example: an engineer deployed code that caused an outage because the deployment did not have proper canary staging. The blame-session version: "the engineer should have been more careful." The structural version: "we do not have canary deployments that catch this class of bug." The structural finding leads to an action item that prevents the next ten engineers from making the same mistake. The blame-session finding leads to one slightly more nervous engineer.
The Five Elements of a Good Post-Mortem Document
01
Timeline
The complete chronological record: when the problem started, when it was detected, every action taken and its result, when it was mitigated, when it was resolved. This is built from the timeline document kept during the incident. It is objective and factual — no interpretation, no judgment, just what happened and when.
02
Impact
What did users experience? For how long? How many users were affected? Was there data loss, data corruption, or financial impact? Being specific here — "22% of checkout requests failed between 2:17 and 3:41 AM, affecting approximately 4,300 unique users" — makes the severity real and sets the right context for the seriousness of the action items.
03
Root Cause Analysis
What actually caused the incident? Not the proximate cause (the thing that broke last), but the causal chain going back to where something structural enabled the failure. Use the five whys or a fishbone diagram if needed, but keep going until you get to something that can be changed. "A code bug" is not a root cause. "Code was deployed without adequate integration test coverage for this class of mutation" is a root cause.
04
What Went Well
This section is not diplomatic fluff — it is genuinely important. Understanding what worked means you can preserve and strengthen those things. Did the on-call rotation catch the issue quickly? Did the runbook accurately describe the mitigation steps? Did someone's clear-headed decision save 30 minutes of investigation? Name it specifically, explain why it worked, and consider making it a standard practice.
05
Action Items
Each action item has three fields: what exactly will be done, who owns it by name (not "the team"), and by when. No exceptions. An action item without an owner is not an action item — it is a wish. Review these action items in the next team meeting and track them in your project management system, not just in the post-mortem document.
The Post-Mortem as a Project Restart
There is a subtler purpose to the post-mortem that most teams miss. Beyond preventing the same incident from happening again, the post-mortem is the moment when the incident becomes part of the project's history in a structured way. The root cause analysis often reveals that the incident was a symptom of something larger — a piece of technical debt, an architectural weakness, a process gap that has been accumulating for months.
When you find this, you have a choice. You can document it as an action item and hope someone gets to it eventually. Or you can treat it as the start of a new project: scope it, estimate it, give it an owner, and put it in your roadmap with the appropriate priority.
The best post-mortems produce not just a list of tactical fixes but a clear-eyed assessment of whether the team is sitting on a structural risk that deserves a real project. That is a harder conversation — it involves resources, prioritization, and making the case to leadership that something they cannot directly see is worth significant engineering investment. But it is the conversation that prevents the next crisis.
Running a Parallel Project During an Active Crisis
There is one more scenario worth addressing explicitly: you are running an ongoing project — months of work, multiple teams, real milestones — and then a crisis hits and you are pulled in. Not just pulled in as an observer, but as a technical lead whose judgment and attention are genuinely needed in the incident.
This happens. It is not a planning failure. Crises do not check your project calendar before they occur. But it creates a specific execution problem: you now have two things that both require leadership, and you cannot give 100% to either.
The First Thing to Do: Triage Both
The moment you realize you are being pulled into a crisis, your first question is not "how do I manage the incident?" It is: "What will break on my project if I am not available for the next four hours?"
Most projects have a small number of time-sensitive decision points at any given time. If none of them are in the next four hours, you can safely give most of your attention to the crisis and check in on the project periodically. If there is a critical decision point — a cross-team meeting you cannot miss, a deployment gate that needs your approval, a blocker that only you can unblock — you need to either resolve it before you go deep into the crisis or delegate it to someone on your team with explicit authority to make the call.
Delegating the Project During a Crisis
"I need to focus on the incident for the next few hours. Alice has authority to make decisions on the project while I'm unavailable. Alice, here's the context you need: the three things most likely to need a decision are X, Y, and Z. Here's where I stood on each. Check with me before doing anything irreversible if at all possible."
This takes ten minutes to set up and prevents your project from stalling or going sideways while you are in crisis mode. The alternative — leaving no one in charge and hoping nothing comes up — almost always results in either a stall or a decision made with incomplete information that you have to revisit later.
Coming Back to the Project After the Crisis
After the incident is resolved or mitigated and you return to the project, the worst thing you can do is assume everything continued normally in your absence. Hold a 15-minute re-entry conversation: what decisions were made, what context was accumulated, what changed. This prevents you from re-litigating decisions that have already been made and from operating on a mental model of the project that is several hours out of date.
The Long-Term Lesson: Building a Team That Does Not Need Heroics
There is a version of crisis response that organizations celebrate and reward that is quietly dangerous: the heroics model. In the heroics model, an incident is resolved by one or two extremely capable people who stay up all night, pull off something impressive, and save the day. Leadership praises them. The team admires them. The incident is added to the mythology of the company.
The problem is structural. Heroics mask the underlying dysfunction. If your system requires heroics to recover from failures, the failure mode is not the incident — the failure mode is the system that makes heroics necessary. Insufficient runbooks. Insufficient observability. Insufficient redundancy. Insufficient automation. A single point of knowledge — one engineer who understands how this thing actually works.
The Heroics Trap
Every time a hero saves the day, two things happen. First, the structural problem that required heroics goes unfixed because "the system works — look, we recovered." Second, the hero gets reinforced for the behavior, making them less likely to invest in the operational improvements that would make their heroics unnecessary. The team that runs cleanly on a boring rotation is more reliable than the team with one legend who fixes everything at 3 AM.
The best crisis leaders I have seen over a long career are working constantly to make themselves unnecessary. They write runbooks so that the junior on-call can handle what used to require a senior. They invest in observability so the alert fires before users notice rather than after. They design mitigation strategies — circuit breakers, graceful degradation, automated rollback — so that the system partially heals itself before anyone is paged.
This investment is hard to prioritize. It never appears on a roadmap. It does not have a deadline or a PM pushing it. It pays off in incidents that do not happen, decisions that turn out to be easy because the information was available, and nights you sleep through because a monitor caught something early and a runbook walked the on-call through a resolution without escalating. The absence of crisis is invisible. It is also the measure of a truly well-run team.
· · ·
Putting It Together: The Mindset Shift
Here is the thing about working in a crisis that takes most engineers years to fully internalize: the goal is not to fix the problem as fast as possible. The goal is to manage the situation — the technical investigation, the team's focus, the stakeholder communications, and the energy of everyone involved — in a way that produces the best outcome over the whole arc of the event, including the post-mortem and the follow-up project.
A team that resolves a P0 in 45 minutes through a clear process, with a full timeline and a set of concrete action items, is more valuable than a team that resolves the same P0 in 30 minutes through heroics, with no documentation and the same structural problem waiting to surface again in six weeks.
The IC mindset — staying above the weeds, managing the room, communicating clearly, keeping the response structured — is not a natural way for most engineers to operate in high-stress situations. It has to be practiced, ideally before a real crisis, through game days and rehearsed incident simulations. The first time you try to run an incident with a proper IC structure should not be during your company's worst outage of the year.
Build the habit. Practice the protocol. And remember that the fire drill everyone finds slightly awkward and performative is what makes the real fire something the team handles with calm precision rather than improvised chaos.
The Key Principle of This Chapter
A crisis does not punish bad engineers. It punishes teams that have no structure for running parallel work under pressure. Give one person the room. Run three tracks at once. Communicate by audience, not by volume. And treat every incident as a rehearsal for building the system that makes the next one smaller.
The Most Common Mistake
- The IC becomes an investigator the moment a familiar problem appears, and the coordination collapses.
- Communication goes to everyone or to no one — either extreme prevents leadership from forming an accurate picture.
- The incident is mitigated but never formally transitioned to a project, so the root cause fix drifts forever on someone's to-do list.
- The post-mortem produces action items with no owners and no dates, and nothing changes.
- Heroics get celebrated instead of the structural investments that would make heroics unnecessary.
Three Questions for Your Next Crisis
- Before you start investigating: who is the IC, and have they explicitly agreed to the role?
- At the 15-minute mark: have you sent the first external status update with a specific time for the next one?
- After resolution: does your post-mortem have at least one action item that addresses something structural — not just the symptom?