Soft migrations get framed as the safer alternative to big-bang cutovers. Run both paths, move traffic gradually, compare outputs, then shut the old thing down. On paper, that looks like disciplined engineering. In production, the real work is not copying data or replaying requests. It is managing the moments when an entity is no longer fully old, not yet fully new, and still expected to behave correctly under retries, race conditions, partial failures, and human intervention. That is where migrations stop being operational and start becoming semantic. If you have ever watched a harmless dual-write plan turn into weeks of reconciliation scripts, feature flag sprawl, and incident reviews, you already know the pattern. The hard part is not moving state. The hard part is preserving meaning while state changes shape.
1. Dual writes create split-brain business logic
The first trap appears when teams treat dual writes as a transport problem instead of a correctness problem. Writing to the old store and the new store in the same request path feels like progress, but the minute those systems enforce different constraints, indexing behavior, or transactional boundaries, you now have two interpretations of truth. One path may accept a partial update that the other rejects. One may serialize fields differently. One may apply derived state synchronously, while the other defers it through an event pipeline.
This is why soft migrations often fail long before cutover. You are not just duplicating persistence, you are temporarily duplicating domain behavior. A payment platform migrating from Postgres-backed ledgers to an event-sourced journal can survive lag. It cannot survive two services disagreeing on whether a refund has reached a terminal state. Senior engineers usually solve this by narrowing the write authority, not by adding more retries. Choose one system to own transition logic, make the other a projection or sink, and define exactly which fields may diverge during the migration window.
2. Transitional states leak into places you did not model
Most migration plans document the start state and the desired end state. Very few model the messy middle with the same rigor. That middle is where entities are partially backfilled, reads are source-dependent, and background jobs operate on mixed generations of data. Once that happens, every consumer that touches the entity needs to understand transitional semantics, even if nobody intended that coupling.
You see it in mundane places. A cache warmer assumes all records have a canonical ID from the new system. A fraud rule still keys off a legacy status enum. A downstream analytics job silently drops records because the new schema allows nullability; the old schema was never exposed. None of these is an infrastructure failure. They are failures to define an explicit state machine for the migration itself.
The fix is unfashionably simple. Name the migration states as first-class domain states, not hidden implementation details. “Legacy,” “mirrored,” “new-write,” “new-read,” and “fully-cutover” may feel verbose, but they create a shared contract across services, jobs, dashboards, and support tooling. If you cannot describe what operations are legal in each state, your system does not actually support a soft migration.
3. Read paths drift faster than write paths
Teams obsess over write correctness because corruption is visible and scary. In practice, read-path inconsistency causes just as much damage, and it is harder to detect. During a soft migration, different endpoints, batch jobs, internal tools, and support workflows often read from different sources depending on performance, rollout stage, or convenience. That means the same object can present different realities to different parts of the organization.
Consider a common pattern in a service extraction effort. The customer profile page reads from the new service behind a feature flag, but account recovery still reads from the monolith because the legacy path owns audit history. Customer support sees one state, the user sees another, and both are technically “correct” from their source of truth. This is how trust erodes, internally first and externally later.
A useful discipline here is read-path inventory. Before migration, enumerate every material consumer of the entity, including cron jobs, exports, ops dashboards, and manual admin tools. Then classify them:
- latency-sensitive reads
- consistency-sensitive reads
- historical or audit reads
- bulk or analytical reads
That exercise usually reveals that your migration is not one cutover, but four. It also forces a better decision about which read paths must switch together to preserve business meaning.
4. Idempotency breaks when identity rules change
A soft migration often changes more than storage. It changes identifiers, sequencing rules, deduplication keys, or event boundaries. That is where previously safe retry behavior becomes dangerous. An operation that was idempotent in the legacy system may become duplicate-producing in the new one because the identity model shifted under it.
This shows up in order systems, billing flows, and provisioning pipelines. A legacy path might dedupe on(customer_id, external_ref), while the new service dedupes on a generated operation UUID. Replay the same message after a timeout, and both systems may accept it for different reasons. You do not notice until reconciliation starts finding double shipments, duplicated invoices, or orphaned compensating actions.
Kafka-based outbox migrations are especially vulnerable here. Teams correctly adopt the outbox pattern to prevent lost events, but they still treat the emitted event as the idempotency boundary rather than the business operation itself. The better pattern is to carry a stable, domain-scoped operation key across both old and new paths, then test retries against network failures, consumer restarts, and delayed replays. If your migration plan does not include deliberate duplicate injection, you are assuming correctness rather than proving it.
5. Backfills and live traffic interfere in non-obvious ways
Backfills look boring compared with request-path engineering, but they are where many soft migrations go unstable. The system is now processing historic state reconstruction and live mutations at the same time. Without careful sequencing, the backfill can overwrite fresher data, reintroduce deleted attributes, or produce event storms that swamp downstream consumers.
A classic failure mode happens when teams use “last write wins” timestamps copied from the source system without considering clock skew, batching delay, or semantic ordering. A record updated at 10:03 in the live path arrives before a backfill chunk generated from a 9:58 snapshot, but the backfill writer stamps it later because it processed later. From the new system’s perspective, history just beat reality.
The practical answer is not “pause traffic,” because that defeats the purpose of a soft migration. You need merge rules that distinguish snapshot hydration from authoritative mutation. Many teams succeed with version vectors, per-field authority, or append-only mutation logs that let projections rebuild safely. This is why event sourcing and CDC pipelines are attractive during migrations. Not because they are trendy, but because they make sequencing explicit. Even then, you still need rate limits, replay isolation, and observability around staleness windows, otherwise a backfill simply becomes a distributed denial-of-service attack you launched against yourself.
6. Rollbacks are easy for code, hard for semantics
One dangerous myth about soft migrations is that they are inherently reversible. Traffic shifting is reversible. Feature flags are reversible. Semantics often are not. Once the new system has emitted side effects, changed downstream expectations, or accepted state the old system cannot represent, a “rollback” is no longer a return to safety. It is a second migration under incident pressure.
This matters most when the new system introduces richer workflows. Maybe the new entitlement service supports overlapping grants while the old one only supports a single active plan. Maybe the new identity provider allows linked accounts the old schema cannot encode. You can route reads back to the legacy system, but what exactly will it read? Which shape of truth wins? Who gets paged when support now has to explain why users lost a state that existed five minutes ago?
Experienced teams define rollback envelopes up front. Rollback may be safe only before new-only states are created. After that point, the mitigation strategy becomes forward-fix, quarantine, or shadow-disable, not true reversal. That distinction changes incident playbooks, approval gates, and executive expectations. It also makes the migration plan more honest, which is usually the difference between controlled degradation and chaos theater.
7. Observability measures transport, while correctness fails in meaning
Most migration dashboards are too shallow. They track replication lag, error rates, throughput, and maybe record-count parity. Those metrics matter, but they mostly tell you whether bytes are moving. They do not tell you whether the business meaning of state survives the move. That gap is why migrations can look green right up to the moment finance, support, or customers discover that key workflows no longer compose.
You need semantic observability. Compare not just rows, but invariants. Can an order move from authorized to captured without inventory reservation? Can an account be both suspended and eligible for renewal? Does every emitted “user deleted” event correspond to downstream tombstoning within the expected SLA? Those are state transition assertions, not infrastructure counters.
A strong migration review usually includes three layers of verification:
- transport health
- structural parity
- behavioral invariants
That final layer is where mature teams separate themselves. Google’s SRE discipline popularized error budgets because availability alone is not the whole service contract. The same logic applies here. A migration that preserves uptime while violating state invariants is still failing. The best engineers instrument the state machine, not just the pipes around it.
Soft migrations are valuable because they expose risk gradually instead of all at once. But that only works when you model the migration as a series of controlled state transitions, not a background data move with better PR. If you make transitional semantics explicit, constrain write authority, and observe invariants rather than just infrastructure, you can keep the “soft” in soft migration from becoming a misleading label. The systems that survive these changes are rarely the simplest. They are the ones who treat the messy middle as the real design problem.
