Building Trust During System Migration: How We Reached for 99.9% Data Accuracy

Introduction

In this article, I share my experience contributing to the migration of a critical system from a live monolith to a new microservice, focusing on the challenges of ensuring 99.9% data accuracy before full cutover. I cover how we handled eventual consistency, validated data across two systems, and built trust through rigorous engineering practices.

Migrating critical systems isn’t just about launching new services.
⁠
It’s about migrating trust. Not long ago, I was part of a team working on a major system migration: moving important business logic from a live, production monolith to a brand-new microservice with its own Postgres database. The old monolith stayed live, serving 100% of public traffic.
⁠
The new system had to prove itself — from zero — before anyone could rely on it. Our mission was clear: Prove 99.9% data accuracy before any cutover.
⁠
When we started, our accuracy was effectively 0%. This is a look back at how we approached that challenge: the strategies, the technical lessons, and how real trust is built — slowly, carefully, and with metrics.

⁠Setting the Stage: Monolith to Microservice - The migration involved

• Old System: A mature monolith handling millions of transactions.
• New System: An event-driven microservice architecture backed by Postgres.
• Kafka: Streaming live updates and historical backfills to the new system.
• Grafana + Prometheus: Instrumentation and monitoring for everything.
• Traffic: Still fully routed to the monolith during migration.
⁠
⁠This wasn’t a lift-and-shift. It was a rebuild — while the airplane was flying.

⁠⁠Why Accuracy Started at Zero Syncing?

Two complex systems asynchronously means you start from a simple engineering truth:

Nothing is accurate until you can prove it. At the beginning:
• No full validation pipeline.
• No accuracy metrics.
• No live comparison in place. Without verification, assuming correctness would have been reckless.
⁠
We treated data accuracy as seriously as uptime — building it into our operational SLOs.

⁠⁠How We Measured Data Accuracy?

⁠We immediately built the core metrics:

records_processed_total.
records_mismatch_total.
Accuracy Rate = (processed - mismatches) / processed.
These were exposed through Prometheus and visualized in Grafana dashboards:
- Accuracy rate trends.
- Mismatch breakdowns.
- Kafka consumer lags.
- Error rates across different domains.
We also split comparisons into:
- Fresh comparisons (new events, possibly still settling).
- Aged comparisons (data old enough to have converged).
  This distinction kept our metrics honest and alerts meaningful.

⁠Challenge #1: Latency and Eventual Consistency

Because the new system was event-driven, eventual consistency was part of the design.

• The monolith would update immediately.
• The microservice might lag by milliseconds or seconds (depending on event ingestion and processing). If we compared data too early, we would catch mismatches that weren’t real.

How We Mitigated It:
• Introduce buffer time before comparisons (e.g., 30–60 seconds).
• Retry mismatches before considering them final.
• Split fresh vs. aged metrics so stakeholders saw the full context.

Accepting and designing around eventual consistency prevented endless debugging loops and kept trust in the migration process intact.

⁠⁠Challenge #2: Validating Data Across Two Systems

Validation wasn’t a one-off — it was a continuous, ⁠multi-layered process.

We used two main approaches:

Shadow Reads
⁠ • Added parallel queries to both the monolith and microservice during critical transactions.
• Compared key data fields silently (without affecting user-facing behavior).
• Logged any differences for investigation. This gave us live, production-grade feedback on data divergence.
⁠
⁠Batch Reconciliation
⁠ • Ran periodic batch jobs comparing large datasets.
• Focused on high-value entities and critical tables.
• Highlighted drift that wasn’t visible in real-time traffic.

Shadow reads caught immediate issues; batch scans caught structural ones.

⁠⁠Challenge #3: Handling False Positives

Not every mismatch was a real bug. Common false positives included:

• Timestamp format differences (UTC vs local time).
• Optional fields behaving differently.
• Legacy fields populated differently. Without careful handling, these harmless differences would swamp our metrics and slow real progress.
⁠
⁠What Worked:
• Accuracy workshops: Engineers, QA, and data stakeholders sat together, reviewed mismatch samples, classified real bugs vs benign differences.
• Refined comparison logic: Adjusted matching rules and tolerated known differences.
⁠
⁠These workshops helped reduce noise and allowed us to focus on true blockers to 99.9% accuracy.

Lessons we took away

• You can’t eliminate latency in event-driven systems. You design around it.
• Blind trust kills migrations. Metrics, visibility, and honesty build trust.
• Validation needs to happen at multiple layers. Shadow reads + batch reconciliations together made our system robust.
• Transparency speeds up trust-building. We shared dashboards and accuracy numbers — even when they weren’t great — with leadership and engineers alike.

⁠Conclusion

Trust Was Built, Not Assumed At the beginning, our accuracy was zero. There were no shortcuts — only careful validation, transparency, and iteration.
⁠By the time we moved forward, the new system had earned trust — with measurable proof behind it, not just good intentions.

⁠In any critical migration, that’s the real success: Trust isn’t migrated automatically. It’s rebuilt — record by record, validation by validation, and through relentless honesty in every metric.