Part VIII — Shipping Chapter 30

The Launch Plan

Launch is not a date on a calendar. It is a state machine — a sequence of conditions that must be true before you are allowed to move forward. The team that treats it as a date fails; the team that treats it as a checklist of states ships safely.

What You'll Learn in This Chapter

This chapter takes you from "the code is done" to "users are using it safely." That gap is where most launches go wrong.

  • Why launch is a state machine, not a date
  • The three things you must decide before writing the plan
  • The full anatomy of a launch plan: three phases
  • The launch checklist that actually gets checked
  • Big bang vs. gradual vs. feature-flagged vs. dark launch
  • How to run a war room and make the go/no-go call
  • Rollback: planning for the thing you hope doesn't happen
  • The five failure modes that kill launches at the finish line

Launch Is Not a Date

There is a version of "launch day" that every engineer has lived through. You have been building for months. The date is circled on the calendar. Leadership has been told. Everyone is excited. The date arrives. And then you discover, on that morning, that your database connection pool is sized for test traffic not production traffic, that one of three dependent teams finished their part the night before and hasn't been tested in combination with yours, and that nobody wrote the runbook for the most likely failure mode.

This is not bad luck. This is what happens when a team treats launch as a date rather than a state.

A date is a point in time. A state is a set of conditions that are either true or false. "February 14th" is a date. "All rollback procedures have been tested, the on-call engineer knows how to execute them, and load tests have passed at 2x expected traffic" is a state.

The difference matters enormously. A date can arrive whether you are ready or not. A state cannot. You cannot lie to a state. You can lie to a date — you can tell yourself the team will "figure it out on the day" — but you cannot lie to a checklist that says "load test passed at 2x traffic: yes/no."

The best launch plans are collections of states. They say: here is what must be true in order to proceed to the next phase. Here is what we will check. Here is who approves it. Here is what happens if any of it is not true. The date is a target, not a commitment. The states are the commitment.

"A launch date is a wish. A launch checklist is a contract. You can blow through a wish. You cannot skip a contract without everyone knowing."

The State Machine Model

Think of your launch as a series of states that your system and your organization must move through. Each state has an entry condition — what must be true before you enter it — and an exit condition — what must be true before you leave it.

The Launch State Machine
Development Complete
Code written, reviewed, merged
All acceptance tests pass. No P0 bugs open.
Pre-Launch Readiness
Checklist items verified
All checklist gates passed. Go/No-Go approved.
Launch Window
Rollout executing, war room active
Traffic stable at full rollout. No critical alerts.
Stabilization Period
Monitoring closely, ready to rollback
72 hours stable. Error rates nominal. On-call comfortable.
Launched
Feature flags cleaned up, runbooks archived

Notice that every transition has a condition. You don't get to move from "Pre-Launch Readiness" to "Launch Window" just because the calendar says so. You get to move when the checklist says so. This is the fundamental shift in how you think about launching.

At any state, you can also transition backwards — or sideways to a "blocked" state. If something fails during the launch window, you don't just push through because the date is today. You pause, assess, and either fix it fast or roll back. The state machine makes this explicit and removes the social pressure to proceed unsafely.

Before You Write the Plan

Most engineers jump straight to writing the launch checklist. That's a mistake. Before you write a single item, you need to answer three questions. The answers to these questions determine the entire shape of your plan — which rollout strategy you use, how long your stabilization window is, who needs to be in the war room, and how detailed your rollback procedure needs to be.

The Three Questions

Question 1: What does failure look like, and how quickly will we know?

Some failures are loud. Your service crashes and alerts fire immediately. Some failures are quiet. A subtle data corruption bug that only affects 0.1% of users. A regression in a rarely-used code path. A metric that drifts the wrong direction over 48 hours.

Before you plan the launch, map out the top five ways this launch can fail. For each one, ask: how long does it take for this failure to become visible? How visible is it — loud alarm or slow drift? This tells you how long your stabilization window needs to be, and it tells you which metrics you need to watch in real time during the launch itself.

If your failures are loud (service down, error rate spikes), you might feel confident doing a fast rollout. If your failures are quiet (user-visible bugs that take days of usage to surface), you need a slow rollout with a long stabilization window.

Question 2: What is the blast radius?

The blast radius is the answer to: if this launch goes badly, how many users are affected, and how badly? A launch that affects 100% of users with potential data loss has a much larger blast radius than a launch that affects 10% of users with a slightly degraded UI.

Blast radius determines your rollout strategy. High blast radius means you start with a tiny slice of traffic — 1% or less — and you instrument everything before you roll out further. Low blast radius means you can be more aggressive.

Blast radius also determines whether you need external communications — do you need to notify users in advance? Do you need customer support briefed? Does legal need to review anything? The larger the blast radius, the more of the organization needs to be prepared.

Question 3: Can we roll back, and how long does it take?

This is the question most teams skip. They assume rollback is possible without ever checking. This assumption has caused many very bad days.

Rollback is not always possible. If your launch includes a database schema migration that removes a column, and your new code has been running against that schema for an hour, rolling back means the old code will fail because the column it expects is gone. You cannot simply revert the binary. You have to run a different migration first.

Before you finalize your launch plan, you must answer: what does rollback actually mean for this specific launch? What are the exact steps? How long do they take? Who can execute them at 2am? Have we tested them?

If rollback is difficult or impossible, your launch plan must be significantly more conservative. Slow rollout, longer stabilization, more checkpoints. The difficulty of rollback is inversely proportional to how fast you should be willing to proceed.

Common Mistake

Treating rollback as a philosophical concept rather than a tested procedure

"We can always roll back" is one of the most dangerous sentences in engineering. The truth is: you can roll back if and only if you have a written procedure, someone has practiced it, and it has been tested in a staging environment that is similar enough to production to be meaningful. If any of those three conditions is missing, you don't actually have rollback — you have optimism.

Anatomy of a Launch Plan

A launch plan is a document — not long, but specific. It covers three phases: pre-launch, the launch window, and stabilization. Each phase has a set of tasks, owners, and exit criteria. Let's walk through each one.

Phase 1 Pre-Launch — Getting Ready to Launch

This phase covers everything from "code is done" to "we are ready to begin rollout." Its job is to eliminate surprises. Every surprise you can eliminate here is a crisis you avoid during the launch window, when stress is high and time is short.

  • Complete the launch checklist. Every item verified, every owner signed off. See the checklist section below for the full breakdown.
  • Run the load test. Not a load test you ran three weeks ago when the feature was 70% done. A load test against the final code, at 2x your expected peak traffic, for long enough that any memory leaks or slow degradation shows up.
  • Test the rollback procedure. Actually execute the rollback in staging. Time it. Make sure the right person knows how to do it. Write down every command, every step, every gotcha. Don't leave this to improvisation.
  • Brief the on-call engineer. The person holding the pager during launch needs to understand what changed, what the expected failure modes are, and exactly what to do if each one occurs. This is not optional. You cannot launch and leave an on-call engineer in the dark.
  • Confirm monitoring is in place. Every metric you identified as a failure signal must have a dashboard. Every dashboard must have an alert. Every alert must have an owner. Test the alerts — actually trigger them in staging and verify they fire where they should.
  • Communicate the plan. Your team, your stakeholders, and anyone whose work is adjacent to this launch should know: here is what we're launching, here is when, here is who to contact if something seems wrong. No surprises for anyone who might be affected.
  • Run the go/no-go meeting. 24-48 hours before launch, gather the relevant people and run through the checklist together. The goal is not to approve the date — the goal is to confirm the states. If any state is not green, you don't launch on that date. You fix the state.
Phase 2 Launch Window — Executing the Rollout

This is the phase where you actually move traffic to the new system. The war room is active. Everyone is watching. Every decision is deliberate. Nothing happens by accident.

  • Start small. Regardless of your rollout strategy, you start with the smallest reasonable slice of traffic. Watch it for a defined period — not "a while," but a specific amount of time. 15 minutes, 30 minutes, 1 hour. Only when all signals are green do you proceed to the next increment.
  • Document every action. Someone writes down a timestamped log of every action taken during the launch. "10:14am — increased rollout to 5%. 10:14am — error rate stable at 0.2%. 10:27am — increased to 10%." This log becomes invaluable if something goes wrong and you need to reconstruct what happened and when.
  • Assign a single decision-maker. The war room can have many people watching, but only one person has the authority to say "proceed" or "stop." That person is designated in advance. Everyone else can raise concerns, but the decision is not made by committee.
  • Define the abort criteria before you begin. "We will roll back if error rate exceeds X% for more than Y minutes." Fill in X and Y before you start. Do not define the abort criteria in the middle of the rollout when you are stressed and the team is arguing about whether the numbers are bad enough to justify rolling back. That argument happens before the launch, not during it.
  • Do not launch on a Friday. This one deserves its own rule. Do not launch significant changes on a Friday afternoon. If something goes wrong, you will have two days of degraded service while your best engineers are unavailable. Launch on Tuesday or Wednesday, when you have maximum runway before the weekend.
Phase 3 Stabilization — Making Sure It Sticks

Rollout is complete. Traffic is fully on the new system. The war room has disbanded. This is not the end — this is where you watch carefully, because some failure modes only reveal themselves at scale over time.

  • Keep the rollback capability alive. Even after a successful rollout, you should maintain the ability to roll back for at least 24-72 hours. Do not clean up feature flags, do not drop database columns, do not deprecate the old API endpoints the moment the rollout completes. The stabilization window is the window during which you keep your escape hatch open.
  • Monitor the slow signals. During the rollout you were watching fast signals — error rates, latency, crash rates. Now you watch the slow signals — user behavior metrics, funnel conversion rates, support ticket volume, database growth rates. These signals tell you about problems that don't show up in your infrastructure dashboards.
  • Define an end to the stabilization period. The stabilization period ends when you have declared the launch healthy. That declaration should be explicit — it's not just time passing. It's someone with authority saying "I have reviewed the signals from the past 72 hours, everything looks normal, we are declaring this stable." That is when you clean up the feature flags and close the launch war room channel.
  • Write the post-launch summary. What went well, what surprised you, what you'd do differently. This summary is not for blame — it is for the next launch. If you don't write it, you will make the same mistakes again.

The Launch Checklist

The launch checklist is the most important artifact in your launch plan. It is the thing that stands between you and shipping something that is going to hurt your users or wake up your on-call team at 3am.

A good checklist has a few properties. First, it is specific — not "monitoring is set up" but "P95 latency alert configured on the payments service, fires to #payments-oncall, threshold 800ms." Second, every item has an owner — a name, not a team. Third, every item is binary — it is either done or it is not done. "Mostly done" and "good enough" are not states on a checklist.

Here is a template. Your specific project will add items. The mistake is removing items without a very good reason.

Engineering Readiness
Code Quality
All feature code reviewed and approved by at least one other engineer
No P0 or P1 bugs open against this feature
All automated tests passing on the launch branch
Integration tests passing in staging with production-like data
Security review completed (if applicable — auth, data handling, PII)
Performance & Scale
Load test executed at 2x expected peak. P99 latency within budget.
Database query performance validated at expected data volumes
Connection pool sizes, thread pool sizes, queue depths set for production load (not test load)
Rate limits and circuit breakers configured and tested
Caching behavior confirmed (no cold-start stampede risk)
Data & Storage
Database migrations tested in staging. Forward migration and rollback migration both executed successfully.
No missing database indexes on hot query paths
Data backup confirmed current before launch begins
Storage growth projections reviewed — no risk of filling disk within 30 days
Rollback
Rollback procedure written and reviewed by at least one person who did not write it
Rollback procedure tested in staging and timed
On-call engineer can explain the rollback procedure without reading the document
Rollback abort criteria defined: if [metric] exceeds [threshold] for [duration], we roll back
Operational Readiness
Monitoring & Alerting
Dashboard created for the new feature. Link shared in launch plan doc.
Error rate alert configured. Threshold defined. Owner assigned. Alert tested.
Latency alert configured. P95 and P99 thresholds defined. Alert tested.
Business metric alerts configured (conversion rate, order volume, whatever matters)
Alerts routing to the right channel and to the right on-call rotation
Log verbosity during launch window confirmed — enough to debug, not so much it floods the system
On-Call Readiness
Runbook written for the top 3 failure modes. Runbook reviewed by on-call engineer.
On-call engineer briefed on: what changed, what to watch, what to do if each failure mode occurs
On-call engineer is not the same person who will be running the launch (avoid single point of exhaustion)
Escalation path defined: who to call if on-call engineer needs backup
Deployment
Deployment procedure documented and reviewed
Deployment tested in staging — exact same steps as production
Feature flag system confirmed working (if using flags)
Rollout percentages and schedule defined
Launch window confirmed: not a Friday, not during known high-traffic events
Stakeholder Readiness
Internal
Customer support team briefed: what's changing, what questions they may receive, who to escalate to
Sales / account management briefed (if feature affects customers they manage)
Legal / compliance sign-off obtained (if feature handles PII, financial data, or regulated content)
Leadership informed of launch timing and what success looks like
External (if user-facing)
User-facing changelog or release notes prepared
Help documentation updated before launch (not after)
Any required user notifications sent with appropriate lead time
In-app guidance or onboarding flows for the new feature reviewed

Rollout Strategies

How you actually deploy matters as much as whether you're ready. There are four main rollout strategies, each suited to different situations. Choosing the right one is a judgment call based on the blast radius, the reversibility of the change, and the confidence you have in your testing.

Big Bang: Everybody Gets It at Once

You flip the switch and 100% of users are on the new system immediately. This sounds scary but is sometimes the right call — particularly for infrastructure changes that can't be partially deployed, or for changes so small and well-tested that the blast radius is genuinely low.

The advantage of big bang is simplicity. There is no complexity of running two versions simultaneously. There are no concerns about what happens at the boundary between old and new. You ship it, it's done.

The disadvantage is that if something goes wrong, it is wrong for everyone immediately. This means your monitoring needs to be extremely good — you need to detect a problem within minutes, not hours — and your rollback needs to be fast and tested.

When to Use Big Bang

Big bang makes sense when: (1) the change is truly atomic — you cannot have half the system on the old version and half on the new; (2) the failure mode is loud and fast — you'll know immediately if something is wrong; (3) the rollback is fast — under 10 minutes to revert to the previous state; and (4) you've done extensive testing and your confidence is genuinely high.

Gradual / Percentage Rollout: Turning the Dial Slowly

You start with a small percentage of users on the new version — say, 1% — and gradually increase that percentage as you build confidence. 1% → 5% → 10% → 25% → 50% → 100%, with monitoring checkpoints between each step.

This is the most common strategy for significant launches at mature companies. It contains the blast radius at every step. If something goes wrong at 5%, you've affected 5% of users, not all of them. You have time to diagnose before it spreads.

The gradual rollout works best when users are interchangeable — when it doesn't matter which users see the new version. If you have strong user-level consistency requirements (where the same user must always see the same version), percentage rollout can create confusing experiences and you need to be thoughtful about how you implement it.

Gradual Rollout Timeline — Example
Day 1 — 08:00  ▶ Deploy to 1%
               Watch for 30 min. Error rate: 0.18% (baseline). Latency P99: 240ms.

Day 1 — 08:35  ▶ Increase to 5%
               Watch for 1 hour. Error rate: 0.20%. No alerts fired.

Day 1 — 09:45  ▶ Increase to 10%
               Watch for 2 hours. Error rate: 0.35%  ← slightly elevated
               Investigate → traced to a noisy log in one service, not a real error.

Day 1 — 12:00  ▶ Increase to 25%
               Watch for remainder of day. All signals nominal.

Day 2 — 09:00  ▶ Increase to 50%
               Watch for 4 hours. No issues.

Day 2 — 14:00  ▶ Increase to 100%
               ✓ Launch complete. Stabilization period begins.

Notice what this timeline makes possible. On Day 1 at 09:45, the team saw a slightly elevated error rate. Because the rollout is gradual, they have time to investigate — it was a noisy log entry, not a real problem. In a big bang deployment, that same observation would have been terrifying. In a gradual rollout, it was a 15-minute investigation during a low-stakes moment.

Feature-Flagged: Decoupling Deploy from Launch

This is the most powerful and flexible strategy. You deploy the code to 100% of servers — but the feature is off for everyone. The code is in production but dormant, controlled by a configuration flag. Then you turn the flag on for specific users — maybe internal employees first, then a beta group, then wider audiences — all without any code deployment.

Feature flags give you something that no other strategy provides: the ability to deploy code and launch features as two completely separate events. The engineer who finishes the code at 10pm on Thursday doesn't have to worry about the launch — they just merge the code with the flag defaulted to off. The launch happens whenever the team is actually ready, on its own schedule.

Advantages

  • Decouples deployment from launch
  • Easy to target specific users (employees, beta testers, specific geographies)
  • Instant rollback — flip the flag off
  • Enables A/B testing
  • No risk of half-deployed code creating inconsistency

Costs

  • Requires a feature flag system (real infrastructure investment)
  • Flags accumulate as technical debt if not cleaned up
  • Code complexity from flag branches
  • Harder to reason about system behavior when flags are nested
  • Testing matrix grows — old path and new path both need coverage

The most important discipline with feature flags is cleaning them up. A flag that was temporary for a launch and is still sitting in the code a year later is technical debt. The old code path behind the disabled flag is a maintenance burden — it has to be kept working even though nobody uses it. Teams that use feature flags heavily need a strict policy: when a launch is declared stable, the flag is scheduled for removal within a defined timeframe.

Dark Launch: Testing Under Real Load Without User Impact

A dark launch is when you send real production traffic to the new system but don't show the results to users. The request goes to both the old system and the new system. The old system's response is what the user sees. The new system's response is logged, compared, and analyzed — but never shown.

This technique is enormously powerful for high-risk changes where you need to know "will this new system behave correctly under real traffic?" before you commit to showing it to users. The new system gets exercised at full production scale and load, but there is zero user impact if it misbehaves.

Real Example — Payment Processing Migration

A team migrating their payment processing engine to a new provider used a dark launch for three weeks before the real launch. Every payment request was processed by both the old engine and the new engine. The old engine's result was what actually went through. The new engine's result was logged and compared automatically.

The comparison found 47 cases where the new engine returned a different result than the old one. 43 of them were bugs in the new engine. 4 were cases where the new engine was actually more correct. All 43 bugs were fixed before a single user saw the new engine's results. The real launch — when the new engine's results were used — had a near-zero issue rate.

Dark launches are not free. Running dual systems at full load costs money and engineering effort. But for launches where the cost of getting it wrong is very high — financial transactions, medical data, critical user-facing calculations — the cost of a dark launch is trivial compared to the cost of a bad live launch.

The Launch War Room

A war room is just a place where the relevant people are gathered — physically or virtually — for the duration of the launch window. The name sounds dramatic, and sometimes the experience is. But mostly it is a quiet room where people watch dashboards and respond quickly when something goes wrong.

Not every launch needs a war room. A small change with a narrow blast radius and a well-tested rollback can be launched by one engineer with a checklist. But for significant changes — new systems, major features, high-volume migrations — a war room is the right call. It exists to reduce the time between "something is wrong" and "we've done something about it" to the minimum possible.

Who Needs to Be There

The war room should be small. Every person you add is another person who might have an opinion, ask a question at the wrong moment, or muddy the decision-making. The right people are:

  • The launch commander. One person with the authority to make calls. Proceeds, pauses, rollbacks — all decisions flow through them. They can consult others, but the decision is theirs alone. This role should be designated explicitly in the launch plan, before the launch day.
  • The engineer who knows the system deepest. Not the most senior person, necessarily — the person who actually knows where to look when something goes wrong.
  • The on-call engineer. They need to see the launch happen so they understand what has changed by the time they're holding the pager alone.
  • A representative from each dependent team. If your launch depends on three other services, you want one person from each of those teams who can answer questions quickly and fix things on their end if needed.

Everyone else gets the live dashboard link and an agreement to not interrupt the war room unless they see something the war room doesn't.

The Go / No-Go Call

Before you begin the rollout, you make the go/no-go call. This is a formal moment — not casual, not implied. The launch commander goes through the checklist, confirms every item is green, and asks each team representative: "Are you go or no-go?" Each person must answer explicitly.

This might feel ceremonial. It is ceremonial — intentionally so. The ceremony forces people to be explicit about their readiness. It creates a moment where someone who has a nagging concern can say "no-go" and explain why, rather than staying quiet while the launch proceeds and the concern turns into an incident.

The Story Behind This Practice

NASA developed formal go/no-go calls for the Apollo program after the Apollo 1 fire, which killed three astronauts. The investigation found that multiple engineers had concerns before the test but didn't raise them. The formal go/no-go call was designed to create a moment where silence was not an option — you had to say something. Every person in the chain had to explicitly declare readiness or declare a problem. The practice has saved lives in spaceflight. The same principle saves systems in software — it creates a moment of accountability that is too easy to avoid in informal settings.

If anyone says no-go, you have two options: fix the issue and repeat the go/no-go, or delay the launch to the next planned window. You do not pressure people to change their vote. A no-go from one person who has a real concern is worth more than a smooth launch that hides a problem waiting to explode.

Rollback: The Plan You Hope to Never Use

Everyone acknowledges that rollback matters. Very few teams have actually tested their rollback. Even fewer have timed it, written it as a step-by-step procedure, and confirmed that the on-call engineer can execute it at 2am under pressure.

Here is the thing about rollback: the only moment you will ever need it is a moment when everything is going wrong simultaneously. Your error rate is spiking. Alerts are firing. Users are complaining. Leadership is asking for updates. In that moment, you cannot improvise a rollback. You execute the procedure you wrote and tested when things were calm.

A rollback plan answers these specific questions:

  1. What exactly gets rolled back? Just the binary? The binary plus the database migration? The feature flag plus the binary? Be precise.
  2. What are the exact steps? Command by command, action by action. Not "revert the deployment" but the actual commands to run, the actual dashboards to watch, the actual confirmation to look for that says rollback is complete.
  3. How long does it take? Time it in staging. Know your SLA for rollback. If rollback takes 45 minutes, you need to start it 45 minutes earlier than you might have otherwise.
  4. Are there any data implications? If the new system wrote any data during its time live, does rolling back the code invalidate that data? Is there a cleanup step? Does the old code know how to handle data the new code wrote?
  5. Who can authorize a rollback? The launch commander can authorize a rollback during the war room. But what about at 3am, when the launch is over and the war room has disbanded? Who has the authority and the knowledge to initiate a rollback at that point?
The Database Migration Problem

Database migrations deserve special attention in rollback planning because they are often not reversible in the simple sense. If you add a new column and the new code starts writing to it, rolling back the code doesn't remove the column or the data in it. The old code simply ignores the column, which is fine. But if you remove a column that the old code expects, rolling back the binary will immediately start throwing database errors.

The safe pattern is: always write migrations that are compatible with both old and new code during the transition period. Add columns before you deploy code that uses them. Remove columns only after the new code is fully deployed and stable. Never make a migration that assumes the code has already been fully rolled out. The migration and the code should be independently deployable and independently rollback-able.

Common Failure Modes

There are five ways launches fail that are common enough to deserve explicit attention. Most of them are not technical failures. They are process failures — things that happen when the launch plan is incomplete or not followed.

1. The Checklist Nobody Actually Checked

The checklist exists. It lives in a document somewhere. On launch day, someone says "we went through the checklist" — but what they mean is "we looked at the checklist and assumed everything was probably fine." Nobody actually verified each item. Nobody had to say their name next to a completed item. The checklist was theater, not practice.

Fix: assign owners to checklist items. Each item must have a person whose name is on it, who has verified the item themselves, and who is accountable if that item turns out to have been unchecked. The go/no-go meeting is the moment where each owner says "I checked mine, it's good." No names, no launch.

2. The Rollback That Wasn't

The launch goes wrong and the team decides to roll back. They attempt to execute the rollback procedure. It doesn't work — because the procedure was written for a slightly different version of the system, or because a step was missed, or because a dependency that didn't exist when the procedure was written now exists. Now the team is improvising under pressure, making the situation worse.

Fix: test the rollback in staging, with the exact code that will be in production, at least once before the launch. Time it. Put a "rollback tested on [date]: yes/no" item on the checklist. If the rollback hasn't been tested, the launch doesn't go.

3. The Missing Monitoring

The launch completes successfully. Everything seems fine. Three hours later, a customer reports a critical bug. The team looks at their dashboards. The dashboards don't show the bug — because nobody set up a metric for the specific thing that broke. By the time the bug is found, it has affected thousands of users for hours.

Fix: before the launch, explicitly list the top five ways this launch can fail, and confirm that there is a metric and an alert for each one. "We would know within 10 minutes if [failure mode] happened" for each item on that list.

4. The Launch at 4pm on a Friday

The launch was supposed to happen on Thursday. It slipped. "We're basically ready" is the assessment. Someone says "let's just do it Friday afternoon, it'll be fine." The launch happens. A problem surfaces Friday at 6pm. The best engineers are all offline by 7pm. The problem festers over the weekend. By Monday, the blast radius has been two days instead of two hours.

Fix: this is a policy. No significant launches on Fridays or before holidays. Non-negotiable. The pressure to ship is always present; the policy is what prevents it from overriding judgment. If a launch slips past Thursday, it goes to the following Tuesday.

5. Launch and Abandon

The launch succeeds. The war room disbands. The team that built the feature disperses — some go on vacation, some move to the next project, some simply stop paying attention. The feature is in production but nobody is watching it. A slow failure starts developing — not an alarm-triggering spike, but a gradual drift in the wrong direction. It's discovered three weeks later by a user, not by engineering.

Fix: define the stabilization period before the launch. Assign an owner for that period. The owner is responsible for reviewing the slow signals every day until the period ends and the launch is declared stable. The end of the stabilization period is a declaration, not an assumption.

The Principle in One Sentence

Every launch decision you make under pressure in the war room was actually made before the launch — when you wrote the rollback procedure, when you tested the alerts, when you ran the go/no-go meeting. By launch day, you are not making decisions; you are executing decisions already made.

The Most Common Mistake

Treating the launch checklist as something you create the week before launch, when you are already behind. The checklist must be built throughout the project — items added as you discover what "done" actually means for each component. A checklist written in a hurry the week before launch is a checklist full of things that look checked but aren't.

Three Questions for Your Next Launch Review
  • If something goes wrong 6 hours after the launch, after the war room has disbanded, what is the on-call engineer's exact playbook?
  • What is the slowest failure mode — the one that takes 48 hours to become visible — and do we have a metric that will surface it?
  • If we have to roll back at 2am, what are the exact steps, and has the person executing those steps ever done them before?

Key Takeaways from Chapter 30

  • 01 Launch is a state machine, not a date. Every state has entry conditions. You cannot move to the next phase until those conditions are satisfied — regardless of what the calendar says.
  • 02 Before writing the plan, answer three questions: what does failure look like and how fast will you know, what is the blast radius, and can you actually roll back — and how long does it take?
  • 03 Choose your rollout strategy based on blast radius and reversibility. Feature flags are the most flexible. Dark launches are the safest for high-stakes changes. Big bang is only appropriate when the change is atomic and rollback is fast.
  • 04 The war room has a single decision-maker. The go/no-go call is explicit and formal. Abort criteria are defined before the rollout begins, not during it.
  • 05 Rollback is only real if it has been tested, timed, and written as a step-by-step procedure that someone can execute under pressure at 2am without improvising.
  • 06 The stabilization period is as important as the launch window. Some failures are slow. Keep the rollback capability alive until you have explicitly declared the launch stable after reviewing the slow signals.
  • 07 Never launch on a Friday. This is a policy, not a preference. The cost of enforcing it is one day of delay. The cost of violating it is a ruined weekend for your entire team and a bad two days for your users.