Meta Engineering 20260512 Migrating Data Ingestion Systems at Meta Scale Summary

Generated by Codex with GPT-5

What happened

Meta Engineering’s official engineering blog published Migrating Data Ingestion Systems at Meta Scale, a May 12, 2026 post about replacing the data-ingestion architecture that moves social graph data from one of the world’s largest MySQL deployments into Meta’s data warehouse.

The post is interesting because it treats migration as a production-systems problem rather than a one-time cutover. Meta’s ingestion system incrementally scrapes several petabytes of social graph data from MySQL every day and feeds analytics, reporting, machine learning training data, and downstream product workflows. The legacy architecture had been customer-owned pipeline heavy: workable when the system was smaller, but increasingly unstable as scale grew and data landing deadlines tightened. The new architecture moves that responsibility into a simpler self-managed warehouse service, but the hard part was not only building the new path. It was moving 100% of the existing workload without corrupting data, increasing latency, overrunning capacity, or leaving consumers to discover defects.

Meta’s migration lifecycle is the core mechanism. Each job had to pass three main promotion criteria before advancing: no data-quality mismatch between old and new systems, no landing-latency regression, and no resource-utilization regression. For critical tables, service owners added stricter criteria. That makes the migration less like a binary deployment and more like an evidence pipeline where every job produces signals about correctness, freshness, and cost before it is trusted.

The first stage was a shadow phase. Meta ran the new-system job in pre-production against the same source as the production job, but wrote to a shadow table. That exposed the new system to real production data and behavior while keeping its output away from consumers. Engineers compared row counts and checksums between the production table and the shadow table, investigated mismatches, fixed causes in pre-production, and verified that the fixes resolved the discrepancy. They also measured compute and storage needs before allowing the job into production.

The second stage, the reverse shadow phase, is the most practical design choice in the post. Once the new and old jobs were both reliable in production, Meta swapped the write targets: the new job wrote to the production table, while the old job wrote to the shadow table. That gave the new system real production responsibility while preserving an always-on comparison path. It also made rollback fast because the old job still existed and was still producing comparable output. Only after continued monitoring showed no discrepancies did Meta remove the old-system shadow job and let the new job fully own delivery.

Meta also built custom data-quality tooling around this lifecycle. For each landed shadow-table partition, the tool read the corresponding production partition and compared row count and checksum. Mismatches were logged to Scuba, Meta’s real-time analysis system. An hourly process then queried those logs, found example rows behind the mismatch, and wrote debugging details back to Scuba. That turns “row count differs” from a vague alert into a tractable investigation artifact. The same tooling now remains in use as part of release validation, which is a good sign that the migration machinery became permanent reliability infrastructure rather than temporary project scaffolding.

The rollback problem is harder because both systems use change data capture. Each ingestion job has a full-dump table, a delta table, and the target table consumed by data customers, with job metadata managed centrally. In a CDC pipeline, bad landed data can feed later outputs, so rollback has to stop propagation, not merely switch a pointer. Meta used the reverse-shadow phase to generate early signals before consumers were affected, including triggering backfills on both production and shadow jobs. If the backfill outputs diverged, the job could roll back immediately.

To stop bad data from spreading, Meta marked partitions with data-quality issues in metadata. If a bad partition was a delta partition, new data stopped landing and an engineer was alerted. If it was a target partition, the system selected an older partition and merged it with newer deltas. That is the implementation detail that makes the post more than a checklist: the migration control plane knew enough about CDC lineage to quarantine suspect partitions and repair them with backfill.

The final layer was automation for the full fleet migration. Meta had tens of thousands of ingestion jobs to move, and it could not run every shadow job at once. It emitted lifecycle and promotion signals for each job to Scuba, then built external migration tools that automatically promoted or demoted jobs between lifecycle stages based on those signals. System-level and job-level dashboards let engineers track both overall migration progress and individual failures.

Capacity planning shaped the rollout. Jobs were batched by throughput, priority, and special cases. Known-bad jobs were excluded until fixes landed, which reduced duplicate failure noise and avoided expensive repeat full dumps. Because a new CDC job’s first snapshot is slow and costly, creating shadow jobs while known bugs remained would have multiplied full-dump work. Meta therefore delayed those jobs and reused snapshot partitions produced by the old system where possible to reduce load.

Why it matters

The strongest engineering idea in the post is that a large migration needs a control system, not just a deployment plan. Meta did not rely on confidence that the new architecture was better. It made every job prove correctness, latency, and resource behavior before promotion, then kept the old path alive long enough to compare outputs after the new path began serving production tables.

That pattern matters for any data platform where correctness failures are persistent. In stateless services, a rollback can often stop new bad responses quickly. In ingestion systems, especially CDC systems, the outputs become future inputs. A bad partition can contaminate later landed data, training datasets, dashboards, billing logic, or product decisions. Meta’s partition-level bad-quality metadata is therefore a key reliability mechanism: it gives the system a way to localize damage, pause propagation, and repair from a known-good base instead of treating rollback as a coarse job-level action.

The reverse-shadow phase is also broadly useful. Many migrations stop comparing old and new behavior at the exact moment when the new system starts handling real production output. Meta kept comparison alive after rollout by making the old job the shadow job. That preserves observability through the riskiest interval: the period when the new path is no longer merely rehearsing, but consumers still need a safety net.

The post also shows why migration tooling should be designed as operational tooling. The row-count and checksum comparator, Scuba logs, hourly mismatch analysis, dashboards, lifecycle signals, and automated promotion rules were created for migration, but the data-quality analyzer remained useful for release validation. That is a useful test of whether migration work is building enduring capability. If the tooling only moves a project through a schedule, it is project management automation. If it keeps catching regressions after the migration, it has become part of the platform’s reliability model.

There is a scale lesson in the batching strategy. Limited shadow capacity forced Meta to think about ordering, exclusion, and blast radius. It did not simply migrate the easiest jobs first or start every possible comparison at once. It grouped jobs by operational features, notified dependent teams, excluded known noisy cases, and avoided full-dump work that would be invalidated by known bugs. That is mundane, but important: at tens of thousands of jobs, migration efficiency is not only how fast each job moves. It is how much wasted verification, duplicate investigation, and unnecessary backfill the system avoids.

The broader architectural move is from customer-owned pipelines to a self-managed warehouse ingestion service. The post does not dwell on the internal details of the new service, but the migration story makes the motivation clear. At hyperscale, a platform cannot depend on every consumer owning bespoke ingestion correctness, rollout logic, and recovery procedures. Centralizing the service gives the platform team one place to enforce lifecycle criteria, data-quality checks, rollback semantics, dashboards, and capacity policy.

Takeaway

Meta Engineering’s post is a concrete production lesson in how to migrate a data platform whose outputs are both high-volume and high-consequence. The successful artifact was not a big-bang cutover. It was a staged control loop: shadow execution, objective promotion criteria, reverse shadowing, partition-aware rollback, automated lifecycle movement, and capacity-aware batching.

For teams running large data migrations, the broader takeaway is to make comparison and rollback first-class design objects. Run the new path against real inputs before it serves consumers. Keep the old path alive as a comparison system after rollout. Compare outputs at the granularity where defects can be isolated. Track bad data in metadata so downstream propagation can stop quickly. Automate promotion, but only from signals that reflect correctness, latency, and cost.

Large migrations become safer when they are treated as continuously verified production workflows. Meta’s data-ingestion migration shows how that mindset turns a risky platform replacement into a sequence of measurable, reversible steps.