Part VII — Advanced Execution Patterns Chapter 26

The Big Refactor / Migration

Migrations are the most misunderstood category of engineering project. The code isn't the hard part. The hard part is moving a live system from one state to another while it keeps running — without anyone noticing.

What This Chapter Covers

Most migrations fail not because engineers wrote bad code, but because nobody had a clear strategy for the transition. This chapter gives you that strategy.

  • Why migrations are fundamentally harder than new features
  • The strangler fig and when to use it
  • Expand-contract, branch by abstraction, parallel run
  • The four phases of dual-write and every trap inside them
  • How to migrate data without downtime
  • Making invisible progress visible to stakeholders
  • How to build a rollback strategy that actually works
  • The cutover and the cleanup nobody finishes

Why Migrations Are Different From Every Other Project

Imagine you are asked to replace the engine in a car while the car is doing 70 miles per hour down a highway. The passengers can't feel any change. The car can't slow down. And if anything goes wrong, you need to be able to put the original engine back — instantly.

That is a migration.

A new feature project starts from nothing. You write code, test it, ship it. If it breaks, you roll it back. Easy. A migration project starts from something that already exists, is already in production, is already being used by real people, and cannot simply be turned off while you work. You have to move the system from its current state to a new state without the system ever stopping.

This is why migrations feel so much harder than building new things. It isn't your imagination. They are harder. Not because the technical work is more complex, but because you are doing the technical work while simultaneously managing a live system that you cannot break.

The category of "migration" is wide. It includes:

  • Moving from one database to another (MySQL to Postgres, Postgres to Cassandra)
  • Replacing a monolith with microservices, or microservices with a modular monolith
  • Rewriting a service in a new language or new framework
  • Moving from one API version to another while both live in production
  • Changing a core data model — renaming columns, splitting tables, changing data types
  • Moving from one infrastructure platform to another (on-prem to cloud, one cloud to another)
  • Replacing a third-party dependency (a payment provider, an auth system, a search engine)

These all look different on the surface. Under the surface, they all have the same structure: there is an old thing and a new thing, and you need to move from the old thing to the new thing, one careful step at a time, while everything keeps working.

The Invisible Problem

Here is something that will trip you up if you're not ready for it: for most of a migration's lifespan, the project looks like it isn't making any progress. Users see the same UI. The business sees the same metrics. Other engineers see the same behavior. From the outside, nothing has changed. But internally, you and your team have been working for weeks or months.

This invisibility creates two specific dangers.

The first danger is that leadership loses confidence. If you can't show progress, stakeholders start to wonder whether the project is real. They ask questions like "when are we going to see results?" and "is this even working?" You can lose support and resources before the work is done.

The second danger is that the team loses morale. Engineers like to ship things. They like to see their work in production. A migration that lives in a shadow state for a long time — technically in production but not "done" — is demoralizing. Engineers start to feel like they're running on a treadmill.

Both of these dangers are manageable if you anticipate them. We'll come back to how in the section on communicating progress. For now, just recognize that invisibility is not a side effect of migration work — it is a core property of it. Plan for it from day one.

Real Pattern

The "Two Years and Nothing to Show" Migration

A team spends 18 months migrating from a monolith to microservices. For the first 15 months, users see nothing different. Internal progress is real — services are being carved out, contracts are being defined, traffic is being tested. But because the team never built a way to show that progress externally, leadership cancels the project at month 18. The new VP looks at the cost, sees no user-visible output, and decides to stop. The engineers who did the work leave the company. The monolith survives another five years.

The project didn't fail because the approach was wrong. It failed because nobody managed the story of the project alongside the technical work.

Why Migrations Actually Fail

Most migrations are killed not by technical failure but by one of these five forces. Knowing which one is acting on your project is the first step to surviving it.

Force 1: Scope That Keeps Growing

You start with "let's migrate the user service." Then you discover the user service has undocumented dependencies on the billing service. Then you realize the billing service uses a column in the user table that you were going to drop. Now you're migrating two services. Then you find another dependency. This is called dependency drag, and it is the most common reason migrations fail. The scope was never as small as it looked.

Force 2: The Old System Doesn't Stop Changing

While you are building the new system, the old system keeps getting new features. Product needs new functionality. Engineers make small improvements. Bug fixes go in. Every change to the old system is a change you might also need to make to the new system — or a change that makes the old and new systems diverge further. By the time you're ready to cut over, the new system is already out of date.

Force 3: No Clear Definition of Done

You said you were going to "migrate to the new database." But what does that mean exactly? Does it mean 100% of traffic goes to the new database? 99%? Does it mean the old database is fully decommissioned? Does it mean all the application code that talked to the old database is deleted? These are very different endpoints, and if you don't pick one at the start, people will pick different ones as the project goes on. This causes endless confusion about whether the migration is "done."

Force 4: Parallel Running Gets Permanent

Many migrations have a phase where old and new run side by side. This is intentional and temporary. But "temporary" has a way of becoming permanent. The team finishes the migration work, declares the project done, and moves on to other things — while the old system quietly keeps running, consuming resources, confusing new engineers, and accruing maintenance cost. Years later, people say things like "we finished that migration, right?" and nobody is sure.

Force 5: Rollback Becomes Impossible

As a migration progresses, the cost of rolling back increases. Early in a migration, you can stop at any point and nothing bad happens. But after months of work, the old system may have degraded — maybe nobody has been maintaining it, maybe dependent systems have been updated to expect the new behavior. By the time a serious problem is discovered, going back to the old system isn't possible anymore. This is terrifying when something goes wrong, and it happens more often than people plan for.

The Strangler Fig Pattern

The strangler fig is a tree that starts its life as a vine wrapping around another tree. Over years, the vine grows larger and the original tree slowly shrinks inside it. Eventually the original tree is completely replaced — from the outside, you can't even tell the old tree was there.

Martin Fowler named this software migration pattern after that tree in 2004, and it remains the single most useful mental model for migrating a live system.

The core idea is this: you never attack the old system directly. You build the new system around it, gradually move functionality over piece by piece, and only decommission parts of the old system after the new system has fully taken over that piece.

"Never rewrite the old system from scratch. Grow the new system around it. Replace it piece by piece until the old system disappears."

How It Works in Practice

Let's make this concrete. Say you have a monolithic e-commerce application — one big codebase that handles product catalog, user accounts, cart, checkout, orders, and notifications. You want to move to a service-based architecture. Here is how the strangler fig applies:

Step 1 — Put a facade in front of the monolith. This is a proxy, a gateway, or an API layer that sits between the outside world and your monolith. At the start, it does nothing but pass requests through to the monolith unchanged. But now you have a seam. You can intercept traffic.

Initial State — Facade Installed
ClientFacade (pass-through)Monolith (handles everything)

Step 2 — Build the first new service. Pick the smallest, most isolated piece of the monolith. "Smallest" and "most isolated" is key. You don't start with the hardest thing. You start with the thing you can finish without touching everything else. Notifications is often a good first candidate — it's usually loosely coupled and has clear input/output behavior.

Step 3 — Route one endpoint to the new service. Update the facade so that a specific route — say, /notifications — goes to your new service instead of the monolith. Everything else still goes to the monolith. Users notice nothing.

After First Service Extracted
ClientFacade → /notifications → Notifications Service (new)
ClientFacade → /everything-else → Monolith (shrinking)

Step 4 — Repeat. Each new service you extract makes the monolith smaller. The facade's routing table grows. Over time, the monolith handles less and less. You keep going until the monolith handles nothing.

Step 5 — Delete the monolith. Once all traffic routes to new services, the monolith is just sitting there doing nothing. You delete it. The strangler fig is complete.

The Three Most Common Strangler Mistakes

Mistake 1: Not putting in the facade first. Some teams try to do the strangler fig without a facade. They figure they can just update callers one at a time. This doesn't work at scale. You have no control point. You can't gradually shift traffic. You can't roll back quickly. The facade is not optional — it is the mechanism that makes the whole pattern work.

Mistake 2: Starting with the hardest service. Engineers love to tackle the interesting problem first. "Let's start with checkout because that's the most critical." Checkout is almost always the most deeply coupled part of any e-commerce monolith. It touches everything. Starting there means you won't have a working new service for a very long time. You won't be able to show progress. Start with something small that you can finish in weeks, not months. The confidence that comes from successfully extracting the first service is worth more than the complexity you temporarily avoid.

Mistake 3: Leaving the monolith running after extraction. Once a piece of the monolith is fully replaced, delete the old code from the monolith. Every week you don't, someone writes new code that depends on the old path. The monolith grows back like a weed. "We'll clean it up later" is the four most expensive words in software engineering.

Three Other Patterns You Need to Know

The strangler fig is the most famous migration pattern, but it doesn't fit every situation. Here are three others and when to reach for them.

Expand-Contract (for Schema and API Changes)

This pattern solves one of the most common migration problems: you need to change something that other code depends on, and you can't update all the dependents at the same time.

The classic example: you have a database column called full_name and you want to split it into first_name and last_name. But you have 40 services that read from full_name. You can't update all 40 services at once. And you can't drop full_name until all 40 are updated.

Expand-contract solves this in three steps:

Phase 1

Expand — Add Without Removing

Add the new columns (first_name, last_name) alongside the old column (full_name). Write code that writes to all three columns on every write. Read operations still use the old column. At this point, the system is backward compatible. Nothing breaks. Old services still work exactly as they did.

Phase 2

Migrate — Move All Readers to the New Columns

One by one, update the 40 services to read from first_name and last_name. This can take weeks. You can deploy services independently. The old column is still being written to, so old services that haven't migrated yet continue to work. There is no deadline pressure — old and new coexist.

Phase 3

Contract — Remove the Old

Once all services are using the new columns and nobody reads full_name anymore, drop the old column. Stop writing to it. The migration is complete. The system is smaller and cleaner than when you started.

This pattern works equally well for API changes. Add the new field, migrate all callers, remove the old field. The rule is always the same: expand first, contract last, never both at the same time.

Branch by Abstraction (for Internal Code Changes)

Sometimes you need to replace a piece of code that is called from many places inside the same codebase — a logging library, an HTTP client, a data access layer. You can't replace all the call sites at once. And you don't want to maintain a separate branch in version control because branches diverge and create merge nightmares.

Branch by abstraction keeps everything in one branch of your code:

  1. Create an abstraction. Write an interface or abstract class that defines the contract of the thing you're replacing. Make all existing callers use this interface instead of the concrete implementation directly.
  2. Add the new implementation behind the interface. Both the old and new implementations now satisfy the same interface. Neither is visible to callers.
  3. Gradually switch callers to the new implementation. You can switch one caller at a time, deploy after each switch, test, and roll back if needed. The old implementation remains available as a fallback.
  4. Delete the old implementation. When all callers use the new implementation, delete the old one and, if desired, delete the abstraction layer too.

The beauty of this pattern is that you never have a broken state in your main branch. The old implementation is always there as a safety net until you deliberately remove it.

The Parallel Run (for Behavior Verification)

The parallel run is a technique, not a full migration pattern. You run the old system and the new system simultaneously in production, compare their outputs, and only cut over once you've confirmed the new system produces correct results.

This is how Google rewrites critical systems. It's how payment processors replace fraud detection models. Any time correctness is more important than cost, the parallel run is how you build confidence before cutting over.

Parallel Run Architecture
Incoming Request
    ↓            ↓
Old System   New System
(response used) (response checked)
    ↓            ↓
    Comparison & Logging
           ↓
User sees old system's response until confidence is established

The parallel run has real costs. You're doing every operation twice. Compute costs double. You need infrastructure to do the comparison and store the diffs. And you need someone to actually look at the diffs and fix the discrepancies.

Despite the cost, when you absolutely cannot afford a wrong result in production, the parallel run is the only honest way to build confidence. "We tested it in staging" is not the same as "we ran it in production against real data and the results matched."

The Dual-Write Era

Almost every database migration goes through a period where the application writes to both the old and new databases simultaneously. This period is called the dual-write era, and it is where most database migrations fall apart.

Let's say you're moving from MySQL to Cassandra. You can't just point your application at Cassandra and hope for the best. You need to:

  • Get all existing data from MySQL into Cassandra (the initial data copy)
  • Keep Cassandra up to date with writes that happen during and after the copy
  • Verify that Cassandra's data is correct before trusting it for reads
  • Gradually shift reads from MySQL to Cassandra
  • Only stop writing to MySQL once you're fully confident in Cassandra
  • Only decommission MySQL once you've proven you don't need it

This sounds straightforward. It isn't. Here are the four phases, and what actually happens in each.

The Four Phases of a Database Migration

Phase 1

Write to Old Only, Copy Data to New in Background

Weeks 1–N

Your application is still 100% on the old database. In the background, you run a migration job that copies data from old to new. This job must be written carefully — it should be resumable (if it crashes, it can pick up where it left off), it should be rate-limited (it should not overload production), and it should be idempotent (running it twice should produce the same result as running it once).

The hard question in this phase: how do you handle writes that happen while the copy is running? If a row is being copied and someone updates it, which version does the new database get? You need a way to handle this — either by using a change data capture (CDC) stream that captures every write to the old database and replays it to the new one, or by doing the copy in a specific order that ensures you catch up to live writes before you cut over.

Phase 2

Write to Both, Read from Old

The Transition Phase

Now you update your application to write to both databases on every write operation. Reads still come from the old database only — users are unaffected. But the new database is now getting live writes. This is how you close the gap between what was copied and what is happening now.

This phase also lets you start verifying the new database. You can run background jobs that sample rows from both databases and compare them. You can run specific queries against both and check that results match. You're building confidence without taking any risk.

Phase 3

Write to Both, Read from New (for Some Traffic)

The Confidence Phase

You start shifting read traffic to the new database. Not all at once — maybe 1% first, then 5%, then 25%, then 50%, then 100%. At each percentage, you watch for errors, latency increases, and data discrepancies. You give each level of traffic time to prove itself before increasing further. If anything looks wrong, you dial reads back to the old database immediately.

This is the most important phase. The slowness of this phase is what prevents disasters. Rushing through it is how migrations end careers.

Phase 4

Write to New Only, Read from New

The Decommission Phase

100% of reads come from the new database. You stop writing to the old database. But you don't delete it yet. You keep it running in a read-only state for some period — typically the length of your longest backup retention window, or whatever duration makes you comfortable that you won't need to roll back. Then you delete it.

The Four Traps Inside Dual-Write

Dual-write sounds clean in theory. In practice, it has four traps that catch engineers who haven't done it before.

Trap 1: Partial writes. When your application writes to two databases, those writes are not atomic. If the write to the old database succeeds and the write to the new database fails, the databases are now inconsistent. You need a strategy for this: do you retry? Log the failure and reconcile later? Fail the entire request? The answer depends on your system's requirements, but you need to decide in advance, not in the middle of an incident.

Trap 2: Write order non-determinism. If two writes happen close together in time, they might arrive at the old database in a different order than they arrive at the new database — especially if you're using asynchronous replication. This can leave the new database in a state that the old database never had. For databases with strong ordering requirements, this is fatal.

Trap 3: Read-then-write race conditions. Some operations read a value, modify it, and write it back. If the read comes from the old database and the write goes to both databases, but the new database already had a more recent version of that value (maybe from a CDC stream), you'll overwrite the newer value with an older one. This kind of bug is extremely subtle and hard to detect.

Trap 4: Forgetting to remove the dual-write code. After the migration is done, the code that writes to both databases should be deleted. But it often isn't, because the team has moved on, the migration is "done," and nobody wants to touch working code. Six months later, you're paying to maintain a decommissioned database because your application is still writing to it.

Critical Rule

Never Write to Two Databases from Two Code Paths

The dual-write code should be in exactly one place. A single function, a single layer, a single service. If you scatter the dual-write logic across your codebase, you will miss some call sites. Some writes will only go to one database. Your databases will diverge. You won't notice until phase 3 when reads from the new database start returning wrong results.

Data Migration: The Part People Underestimate

Every migration that involves a data model change requires a data migration — a process that transforms existing stored data from the old format to the new format. People consistently underestimate how hard this is. Here is why.

The data is dirty. Production databases contain data that was written over years by code that may no longer exist, engineers who may have left, and requirements that may have changed. There are null values where you don't expect them, invalid formats, duplicates, and inconsistencies that nobody knows about because no code currently tries to use that data. The moment you try to transform that data, you discover all of it.

The table is large. You might be transforming a billion rows. The transformation can't happen in a single database transaction — it has to run in batches, over days or weeks, while production is running. You need to build a migration job that is safe, resumable, and observable.

The transformation has edge cases. You write the transformation code, you test it on 1% of rows, and it works. Then you run it on the other 99% and it fails on some case you didn't see in your sample. Now you have a partially migrated dataset and a transformation job that crashed at row 847,234,921. This is normal. What's not normal is having no plan for it.

Here is how to run a data migration without catastrophe:

  1. Add a migration status column to the table. This column can have values like not_migrated, in_progress, migrated, and failed. Your migration job sets this column as it processes each row. Now you can always tell exactly how far you are, which rows failed, and resume from where you left off.
  2. Process in small batches. 1,000 rows at a time is a reasonable default. Small batches mean small transactions. Small transactions mean small locks. Small locks mean production traffic is not significantly impacted.
  3. Log everything that fails. When a row fails transformation, log it to a separate table with the error and the row's ID. Don't crash the job. Handle the failure, log it, and move on. After the job completes, you have a list of every row that needs manual review.
  4. Run a verification pass after the migration. After all rows are processed, run a separate verification job that reads the migrated data and confirms it looks correct. This is separate from the migration job. Its only job is to check, not to fix. If it finds problems, you fix them separately and run verification again.
  5. Keep the migration status column forever. Don't drop it after the migration. It's cheap to store and valuable for debugging anything that goes wrong in the future.

Communicating Progress on Invisible Work

This is the section that separates engineers who survive large migrations from those who get their projects cancelled mid-flight.

A migration lives in a peculiar state: it is enormously valuable to the business (you're reducing technical debt, improving reliability, enabling future features) but that value is invisible until the migration is done. Users don't see it. Metrics don't show it. The CEO can't demo it. Leadership has to take it on faith — and faith without evidence runs out after three or four quarters.

Your job is to make the invisible visible. Not through spin or theater, but through real metrics that honestly represent where you are.

Metrics That Show Real Progress

For every migration, there is a natural unit of "how much is done." Find that unit and track it weekly, publicly.

Migration Type The Natural Progress Unit What to Track
Monolith → Services % of endpoints routed to new services Endpoint count, traffic % by service
Database migration % of rows migrated, % of read traffic on new DB Row count, traffic %, error rate delta
API version migration % of callers on new API version Caller count by version, deprecated call volume
Language/framework rewrite % of code paths covered in new implementation Test coverage, % of routes handled
Data model change % of rows in new schema Migrated row count, services using new schema
Cloud migration % of workloads on new infrastructure Service count, cost by platform

Once you have your unit, put the number on a dashboard. Update it every week. Include it in every status update. Make it so that anyone in the company can look up exactly how done the migration is at any moment, without asking you.

The dashboard serves a secondary purpose: it creates mild social pressure on teams that own pieces you're waiting on. When the dashboard says "the checkout team is the last one blocking 100% migration," the checkout team starts to feel that. This is not manipulation — it's accountability made visible.

Stakeholder Updates That Work

Your status updates for a migration should follow a specific structure. The goal of each update is to answer three questions that your stakeholders are always silently asking: Are we still making progress? Is anything threatening the timeline? What do you need from me?

Update Structure

The Migration Status Update Format

Progress number first: "We are at 67% complete (up from 58% last week)." Lead with the number. If there's no number, there's no update.

What moved: One sentence on what caused the progress. "The search service cutover completed, adding 9% traffic to the new database." This shows that progress is intentional and explained, not random.

What's next: The next specific milestone. "Next week we plan to cut over the notifications service, which will bring us to ~75%." This shows that you have a plan and can predict near-term progress.

Risks: One sentence if anything is a concern. "The payments team has a competing deadline next week — we may slip the payments cutover by one week." Surface risks early. Nobody is shocked by a risk that was called out three weeks in advance. Everyone is shocked by a miss that wasn't.

Ask: One clear ask if you need anything. "If we can get one extra engineer from Platform for two weeks, we can accelerate the remaining data migration by about three weeks." Make it specific and time-bounded.

Keep the update short. One page. If you can't say where you are in one page, you haven't thought about it hard enough.

One more thing: don't only update when things go well. The fastest way to lose credibility with leadership is to go silent when the migration hits a snag, then surface the problem at the last possible moment. Bad news delivered early is a problem you can solve together. Bad news delivered late is a crisis you absorb alone.

Building a Rollback Strategy That Actually Works

Here is the hard truth about rollback: the earlier in a migration you are, the easier rollback is. The later you are, the harder — and eventually impossible — it becomes.

At the very start of a migration, rollback is trivial. You haven't done anything yet. You just stop. At the very end, rollback might mean undoing months of work and accepting significant data loss. Between those two extremes, there are specific points where the rollback cost increases dramatically. These are called points of no return.

The single most important thing you can do for rollback planning is to explicitly identify these points before the migration starts, and decide in advance what you will do at each one.

The most common points of no return:

  • After you stop maintaining the old system. If you stop fixing bugs and adding features to the old system while the migration is in progress, rolling back means returning to a system that is now further behind than when you left it. This is often fine for a short migration. For a long one, it's a serious problem.
  • After you delete data from the old system. If your migration involves deleting rows, columns, or tables from the old system, that data may not be recoverable without backups. The moment data is deleted, rollback requires a restore — which is slow, imprecise, and may lose recent writes.
  • After you tell users the old thing is gone. If you've communicated externally that an old API endpoint or feature is retired, rolling back creates confusion and trust damage.
  • After dependent systems are updated. If 20 other services have been updated to only talk to the new system, rolling back means rolling back all 20 of those services too. This may not be practical.

For each point of no return, document:

  1. What would trigger a rollback at this point (what goes wrong)
  2. What rollback would look like technically (step by step)
  3. How long rollback would take
  4. What data loss, if any, would occur
  5. Who makes the decision to roll back

This documentation isn't just for the catastrophe scenario. It also changes how confidently you move through the migration. When you know exactly what rollback looks like at each step, you're not afraid to proceed. Confidence in your rollback plan is what lets you take the next step forward.

Critical Mistake

Don't Assume "We Can Always Roll Back"

The most dangerous sentence in a migration is "we can always roll back." It is usually said by someone who hasn't actually thought through what rollback would require at that specific point in the migration. At some point in every migration, the cost of rolling back exceeds the cost of pushing forward through a problem. If you haven't identified when that point is, you will discover it at the worst possible moment.

The Cutover

The cutover is the moment you flip the switch — when traffic moves from the old system to the new one for the last time, and there is no going back. After months or years of careful dual-running, this moment is both exciting and terrifying.

The way you avoid the terrifying part is by making the cutover as boring as possible. By the time you do the final cutover, it should feel like nothing. Because in a well-executed migration, the cutover isn't a special event — it's the last step in a long series of steps you've already done many times in smaller form.

Here is what makes a cutover boring (in the good way):

You've already partially cut over. If you've been running the strangler fig properly, by the time of the final cutover you're not moving 100% of traffic — you're moving the last 10%. The first 90% was already cut over in earlier, smaller increments. The final step is the least scary step.

You have a rollback mechanism ready. Before you flip the switch, you have a single configuration change that will flip it back. It's tested. It works. You've run it in staging. The only difference between the cutover and the rollback is which direction the flag points.

You have monitors running. You're watching error rate, latency p50/p95/p99, business metrics (orders placed, payments processed, whatever the system is responsible for), and any canary metrics specific to the migration. You have alerts set. You have humans watching dashboards. Nobody is at dinner or asleep.

You have a communication plan. You know exactly who to notify when the cutover starts, who to notify if something goes wrong, and who to notify when it's done. You have a pre-written post-mortem structure ready in case you need it. You have a pre-written announcement ready for when it succeeds.

You have a decision timeline. Before you start, you agree: if we see problem X within 30 minutes, we roll back. If everything looks good for 2 hours, we declare success. You don't make these decisions in the heat of the moment while alarms are going off.

Cutover Checklist
  • Rollback mechanism tested and ready
  • All critical monitors and alerts verified
  • Decision criteria pre-defined ("if X, we roll back")
  • Communication plan ready (start, problem, success)
  • Right people on standby (not just the team — ops, support, leadership)
  • Cutover is not on a Friday afternoon
  • Post-cutover monitoring period defined (not "declare victory and go home")

One rule that experienced engineers never violate: don't cut over on a Friday. If something goes wrong, you want the full team available for the full week to fix it. A Friday cutover that goes wrong means a weekend incident, exhausted engineers, and a Monday with a half-working migration and a demoralized team.

The Cleanup Nobody Does

The cutover is not the end of the migration. It is the beginning of the end. After you cut over, there is a cleanup phase that most teams skip, and skipping it is how migrations become permanent.

The cleanup phase includes:

Deleting the old code. All the code that implemented the old system, all the adapter code that made old and new work together, all the migration infrastructure. Every line you don't delete is a line someone will need to understand in six months. Delete it while you still remember what it did.

Decommissioning the old infrastructure. Stop the old servers. Drop the old database. Cancel the old cloud resources. This isn't just about cost — it's about clarity. As long as the old infrastructure exists, people will wonder if it's still in use. "Is the MySQL database still getting writes?" shouldn't be a question that requires investigation.

Updating documentation. Runbooks, architecture diagrams, onboarding docs, API documentation. Everything that referred to the old system needs to be updated or deleted. Documentation that describes a system that no longer exists is worse than no documentation — it actively misleads people.

Removing migration status tracking. The dashboards, the migration status columns in the database, the progress reports. The migration is done. These artifacts should go away so nobody confuses them with active work.

Running the retrospective. Every migration teaches you something. What went faster than expected? What went slower? What would you do differently? What patterns should be standard practice for the next migration? Capture this while it's fresh. In six months, most of it will be gone.

"The migration isn't done when traffic moves. The migration is done when there's nothing left to remind you the old system existed."

The reason cleanup gets skipped is that it is unrewarding. The exciting work is done. The team wants to move on to the next project. The pressure from leadership is gone because the migration "shipped." But the cost of skipping cleanup compounds. Old code confuses new engineers. Old infrastructure generates bills. Old documentation creates incidents. The five days of cleanup you skip today become fifty days of confusion over the next three years.

Schedule the cleanup sprint before the cutover. Put it on the roadmap. Assign owners. Treat it like a deliverable, not an afterthought. The migration isn't done until the cleanup is done.

The Central Principle of Migration Execution

Every migration has two parallel projects running simultaneously: the technical work of building and moving to the new system, and the organizational work of keeping the project alive long enough to finish it. Engineers who master migrations understand that the second project is just as important as the first — and they plan for both from day one.

Chapter Summary

What You Now Know

  • Migrations fail for five predictable reasons — scope drag, a changing old system, no definition of done, permanent parallel running, and impossible rollback
  • The strangler fig wraps a new system around the old one, replaces it piece by piece, and only deletes the old piece after the new piece is proven
  • Expand-contract solves schema and API changes: add first, migrate readers, then remove the old
  • Branch by abstraction keeps the full migration inside one codebase without long-lived branches
  • The parallel run is the only way to build real confidence before cutting over critical systems
  • Dual-write has four phases and four traps — all of them are predictable and avoidable
  • Data migration needs status columns, batch processing, failure logging, and a verification pass
  • Progress dashboards showing a clear percentage-done metric are the single best tool for keeping leadership support through a long migration
  • Rollback plans must be written before you start and updated at each point of no return
  • The cutover is made boring by doing many small cutovers before the final one
  • Cleanup is not optional — schedule it as a deliverable before the cutover

What to Do Next Week

  • If you have a migration in flight: add a progress percentage to your weekly status update this week, before anything else
  • Write down the three points of no return in your current migration and what rollback looks like at each one
  • If you're using dual-write: confirm that all dual-write code lives in one place and one place only
  • If you're at phase 4 of any migration: schedule the cleanup sprint now, before you've moved on mentally
  • For any migration not yet started: write a one-paragraph definition of "done" that you and your stakeholders can sign off on before you write the first line of migration code
The Principle in One Sentence

A migration is not a one-time event — it is a transition that must be safe to stop, safe to reverse, and visible to everyone watching, at every point along the way.

The Most Common Mistake

Treating the cutover as the finish line. The cutover is the halfway point. The cleanup — deleting old code, decommissioning old infrastructure, updating documentation, running the retrospective — is the other half. Teams that skip cleanup spend the next three years paying for it in confusion, cost, and on-call incidents.