How Shopify Uses Change Data Capture to Serve Millions of Merchants
From batch queries to streaming 100k records per second during peak load
Fellow Data Tinkerers!
Today we will look at how Shopify built a real-time data pipeline at 400TB scale.
But before that, I wanted to share with you what you could unlock if you share Data Tinkerer with just 1 more person.
There are 100+ resources to learn all things data (science, engineering, analysis). It includes videos, courses, projects and can be filtered by tech stack (Python, SQL, Spark and etc), skill level (Beginner, Intermediate and so on) provider name or free/paid. So if you know other people who like staying up to date on all things data, please share Data Tinkerer with them!
Now, with that out of the way, let’s get to change data capture at Shopify!
TL;DR
Situation
Shopify’s batch Change Data Capture (CDC) powered analytics but suffered from latency, blind spots and inconsistent results across shards.
Task
The team needed a unified, near real-time pipeline that scaled across 100+ MySQL shards without overloading sources.
Action
They built a log-based CDC system using Debezium and Kafka, adding schema registry, snapshot fixes and GCS offloading for oversized records.
Result
The platform now handles 400TB+, 100k events/sec at peak and cut merchant data freshness from 24 hours to ~1 hour.
Use Cases
Product reporting, marketing engagement tracking, stream processing
Tech Stack/Framework
Apache Spark, Presto, dbt, MySQL, Debezium, Apache Kafka, Apache Avro, Hive, Parquet, BigQuery
Explained further
Context
Shopify’s data platform has been through several life stages since 2004. What started as a warehouse for internal analytics grew into a data lake with multiple storage systems, processing engines and hungry consumers. Along the way the team moved to the cloud and adopted open-source tooling like Apache Spark, Presto and dbt.
The mission also expanded. The platform isn’t just for internal dashboards anymore. It now powers analytics needs for more than a million merchants who want fast, consistent, trustworthy views of their business.
That growth exposed cracks. OLAP queries were slow for some use cases. Serving structured datasets from the lake was clunky. Batch ingest meant delays. The business needed faster ingest and lower-latency queries on semi-structured data so the team split the world into two platforms.
1- Internal data warehouse
Extract. Batch jobs on a schedule. Query-based extraction against the sharded core. Change events per shard get written to cloud storage and denormalized into large Hive or Parquet tables.
Transform. Spark and dbt handle the heavy lifting.
Load. Downstream sinks include BigQuery, Redshift and custom services that serve structured data inside Shopify.
2- Merchant analytics platform
Extract. An always-on batch-query application. It is still query-based CDC, but change events go straight to Kafka.
Transform. Lightweight streaming transforms write outputs back to Kafka.
Load. Kafka streams into Google Bigtable, which powers queries in the merchant admin.
Running two platforms unlocked merchant analytics without blocking the data science team but it was not all smooth sailing
When a fix creates its own problems
While the two platforms was beneficial for merchant analytics, it created real costs:
Different tools and standards. Shipping a merchant dashboard means writing Golang, a Kafka streaming app and Bigtable tables. Very different skills from writing a batch Spark job.
Two extraction tools, two answers. Timing differences mean two jobs can disagree on the same table.
Duplicate work. Teams build similar analytics twice.
As Shopify’s product stretched beyond storefronts to things like email, capital, balance and marketing, analytics needs ballooned. The split amplified the pain. The obvious fix was a single, unified extraction pipeline that meets three non-negotiables:
Access to data in the main sharded monolith.
Fresher data for general use as quickly as possible.
High quality that reflects every upstream change.
Let’s have a look at how Shopify achieved these requirements.
Longboat: Batch CDC at Shopify
Longboat is Shopify’s internal query-based CDC service. It runs batch extraction jobs per table, per shard. Every run pulls rows updated since the last run then writes results to cloud storage.
Design choices matter here:
Per-table, no joins. Keep impact on the source low.
Read-only replicas. Long-running queries do not hammer primaries.
Hourly at best. Due to source load, a given table’s query runs at most once per hour. Add processing and copying and you are always at least an hour old.
The outputs land in storage, then get joined into a snapshot for analytics, modeling and reports. Solid for batch. Not great as a general-purpose, fresh data feed.
The limits of batch query capture
The naive plan is SELECT * FROM table
. That dies on billion-row tables. Longboat instead increments on updated_at
. It runs sequential jobs that fetch only rows where updated_at
is newer than the prior job’s high-water mark.
This is efficient and minimises load but it introduces blind spots.
Longboat issues
1- Unable to capture hard delete events
Soft deletes flip a flag and bump updated_at
. You see them. Hard deletes erase the row. There is no updated_at
to detect.
Query-based capture never sees it. Many Shopify tables use soft delete partly to work around this. That keeps data for analytics but it bloats tables and complicates true data removal.
2- Reliance on updated_at
Relying on updated_at
for analytics downstreams, assumes every update touches column. That is usually true but not always. At times, the Shopify teams avoided the column during migrations to keep Longboat from flooding. And sometimes bugs in the code impact the update of the column. That meant those updates would go missing.
Using the earlier example, if Peyton’s name was updated to Eli, but the "updated_at"
was never updated, that update would be missed downstream, as shown in the following figure.
3- Misses intermediate states of a row between jobs
If a row changes multiple times between hourly runs, only the latest state is captured. Any intermediate values are lost. That breaks models that care about the precise sequence of events.
4- Consistency
Join two snapshots with different high-water marks, say sales at 2:15 and products at 2:30. What freshness do you claim for the result? This is awkward to communicate to merchants. It chips away at trust.
5- Records that never get caught
There is a rare case where Longboat never “catches” the row because it’s constantly being updated/touched and always out of reach as something keeps touching the row and updating it. There isn’t a solution other than to stop making updates.
You might think: “but couldn’t this be addressed by making Longboat faster?”
Short answer: no. Faster queries pound the sources. Even read replicas have limits. Past a point it becomes technically infeasible and operationally risky.
And even if the queries sprint, the rest of the pipeline is still batch. Each step waits for the previous step. Any hiccup adds delay. You cannot batch your way into real time.
Outgrowing batch: Enter log-based CDC
Longboat did its job for years. New requirements outgrew it. The team needed comprehensive updates, near real-time with minimal load on the source. Missing intermediate events and invisible hard deletes are deal-breakers for several products.
So they pivoted to log-based change data capture. Capture once. Stream everywhere. Event streaming is a natural next step. Shopify already runs Kafka. There is mature tooling to turn database changes into events: Debezium and Maxwell’s Daemon are common choices. There are homegrown options too, like Netflix’s DBLog pattern.
Shopify’s core runs MySQL with a primary and multiple read replicas. MySQL replication relies on the binary log that records table and row changes. That binlog is the foundation for log-based CDC. So the team set out to replace Longboat with a platform built on proven streaming tech.
Designing a unified CDC pipeline
In mid-2019, a cross-functional group at Shopify started testing whether a log-based CDC system could reliably handle the 100+ MySQL shards powering the company’s core monolith. The evaluation included Debezium, Maxwell’s Daemon, Spinal Tap and a homegrown option modeled after Netflix’s DBLog. Among the open-source contenders, Debezium quickly stood out. It was the most active project, backed by responsive maintainers and broad database coverage. That flexibility mattered, since not all of Shopify’s future sources would be MySQL.
Longboat, the existing tool, pulled data from MySQL and dropped it into cloud storage. Debezium shifted the paradigm: it streamed change events directly into Kafka. Because Kafka was already a foundational part of Shopify’s stack, this integration aligned with infrastructure already in place. It also exposed a standard API, enabling teams to reuse Shopify’s existing stream processing tools without reinventing the wheel.
Debezium’s defaults, however, didn’t fit neatly. Out of the box, it produced one Kafka topic per table per database shard. With more than 100 shards, that meant a flood of topics like shopify_shard_1.products
, shopify_shard_1.users
and so on. Consumers would be forced to subscribe to each shard individually. To simplify, Shopify merged events from all shards on a per-table basis, creating a clean 1:1 mapping between a logical table and a Kafka topic.
Here’s how it works in practice:
– Each database shard gets its own Debezium connector, scoped to a subset of tables.
– A RegexRouter transform rewrites the destination so every change event flows into a shard-level topic (e.g. shopify_shard_1_events
).
– A custom Kafka Streams job fans these out again, reshuffling events into per-table topics like products
or users
.
Partitioning is done on the source table’s primary key, guaranteeing that all events for the same row stay in the same partition. To consumers, the shard layer effectively disappears. They just subscribe to a single table topic.
Another crucial design choice is compaction. Since each event is keyed by primary key, compacted Kafka topics preserve only the latest value for each row. This makes them act like a materialized snapshot of the upstream table. Consumers can bootstrap a table simply by replaying its topic. Storage use scales with dataset size, not the number of events generated over time.
Performance targets were also met. From the moment a row is written in MySQL to when its event appears in Kafka, the p99 latency is under 10 seconds. Median latency is typically far lower, with the main variable being replication lag on the MySQL readers feeding Debezium.
The end result is a standardized, real-time ingestion path for the company. Developers no longer need to write brittle, one-off batch jobs. Instead, they just provide connector configuration and the pipeline takes care of the rest.
At the record level, the output is straightforward. A single event like an insert into the addresses table shows up as a structured Debezium record, ready for any downstream consumer to process.
Building a unified CDC platform was a big win, but also a bumpy ride. The next section will go through the challenges the Shopify team faced.
Learning CDC at Shopify, one bug at a time
Debezium has been moving fast. Its 1.0 release only landed in early 2020, right as Shopify began its adoption. At the same time, Shopify had no prior experience with Kafka Connect before CDC, so the team had to learn while building. That combination of maturing project plus new internal expertise meant running into fresh bugs and creating reliability practices along the way.
Schemas and change management
Out of the box, Debezium streams events with Apache Avro schemas. Avro captures table metadata like field names, types, defaults and guarantees a faithful representation of the source. Each table gets its own schema, mapped directly from the upstream definition.
That direct mapping is powerful but with its flaws. Every schema change in MySQL flows straight into the CDC pipeline. Evolution rules help, consumers can usually ignore new optional fields or dropped columns, but not every change is compatible. A type change from Integer to String forces every consumer to update or risk broken processing. These breaking changes are painful and avoiding them has become a company-wide priority. When they are unavoidable, the challenge shifts to early detection and proactive communication to every downstream team.
This is one of the hard truths of CDC: consumers end up tightly coupled to the producer’s schema. Normal application evolution like migrations, refactors, field retypes ripples out beyond a single codebase into dozens of consuming systems.
Schema registry as a backbone
Confluent’s Kafka Schema Registry became the natural fit for managing this complexity. It integrates cleanly with Debezium and Kafka consumers, while also acting as a searchable store of schema definitions. In practice, it doubles as the foundation for data discovery in Shopify’s CDC platform.
Every Kafka topic corresponds to a single domain, with schemas that evolve compatibly over time. Registry metadata lets engineers trace which consumers are reading which topics, generate dependency graphs and track lineage of sensitive fields. Combined with access controls and client identities, it gives a map of how data moves across the streaming ecosystem and a way to notify the right owners before a breaking change goes live.
First snapshots
A fresh Debezium connector always begins with a snapshot: a SELECT * FROM TABLE
dump that publishes the current table contents to Kafka. Once that completes, CDC switches to log-based capture.
There are major problems Shopify faced with Debezium’s table snapshotting:
1- The “allow.list” controls which tables get snapshotte, but adding new tables later only captures changes going forward and no initial state. Consumers expecting full history will miss data.
2- Debezium can’t snapshot a table without pausing binlog consumption. For massive tables that take hours, this pause is unacceptable for latency-sensitive use cases.
3- Some core Shopify tables are simply too large to snapshot at all within a practical window.
These remain open issues for the platform today. To move forward, the team is explored an incremental snapshot tool that would sidestep the blocking issues and scale to Shopify’s largest datasets.
The pain of oversized records
Some Shopify workflows write massive blobs of text into the primary MySQL database. In extreme cases, these records can reach tens of megabytes, well past Kafka’s 1 MB default record limit. While it’s possible to raise the broker setting, doing so hurts performance and still leaves it vulnerable to the next record that’s just slightly bigger.
The tean narrowed the problem down to two approaches:
Split the record into multiple Kafka events and stitch them back together on the consumer side.
Store the record in external storage and let Kafka carry only a pointer.
Option 1: Split records
This approach slices a large record into many smaller ones, each small enough for Kafka. Consumers then recombine the fragments into a single logical record.
Pros:
– Everything stays inside Kafka, including access controls.
Cons:
– Hard to manage records crossing consumer batch boundaries.
– Vanilla consumers won’t understand multi-event records.
At first, this seemed promising. But it quickly became clear that every language client would need custom consumers to reassemble records before business logic could use them. That meant reworking large parts of the consumer pipeline which was doable but complex and error-prone.
Option 2: External storage

Here, a single Kafka record points to the payload stored in Google Cloud Storage (GCS). Kafka carries only the pointer; the data itself lives elsewhere.
Pros:
– Straightforward production and consumption.
– Works with unmodified Kafka consumers.
Cons:
– Dependence on external storage with risks like access control, connectivity or human error.
– Requires careful cleanup of GCS records once they’re no longer accessible through Kafka.
Shopify implemented this path using a custom Kafka Serializer/Deserializer (Serde) layered on Confluent’s Avro Converter. Every record is size-checked and compressed. If compression isn’t enough, the record goes to GCS and a placeholder with the GCS path is written to Kafka. Consumers use the same Serde to fetch, decompress and deserialize.
Impacts on consumers
The presence of oversized records is mostly a legacy of using MySQL as a blob store. Since the CDC pipeline must support access to that data, they built a mechanism for it. But long term, Shopify wants to reduce Kafka’s exposure to giant records. Events should be clean, domain-specific and rarely require megabytes of blob data.
For cases where blobs remain, security is critical. GCS data must be encrypted, locked down with Kafka access control list and deleted as soon as the corresponding Kafka record is gone. These add overhead but they’re necessary to keep large-record handling both reliable and compliant.
Adoption in numbers
So far the numbers are substantial.
Storage. 400 TB+ of CDC data lives in the Kafka cluster.
Connect. About 150 Debezium connectors run across 12 Kubernetes pods.
Throughput. On Black Friday Cyber Monday 2020 the system averaged ~65,000 records per second with spikes up to 100,000 records per second. That was on a subset of tables. Coverage will grow as consumers mature.
Adoption has followed quickly. One of the fastest-growing consumers is the revamped Marketing Engagements API which now runs on CDC streams instead of daily warehouse batches. The shift cut data freshness from a full day down to about an hour, giving merchants near real-time visibility into how buyers respond to campaigns.
Other teams are following the same pattern, moving their batch workflows off Longboat tables and onto streaming sources. Some are experimenting with frameworks for streaming joins and long-lived materialized state, exploring what becomes possible once data arrives continuously instead of once a day.
The full scoop
To learn more about this, check Shopify's Engineering Blog post on this topic
If you are already subscribed and enjoyed the article, please give it a like and/or share it others, really appreciate it 🙏
Keep learning
How Grab Shrunk Real-Time Queries from 5 Minutes to 1 with FlinkSQL and Kafka
Cold starts, version drift and clunky notebooks, Grab hit all the classic headaches of streaming at scale.
Here’s how they fixed it with FlinkSQL + Kafka.
How Expedia Monitors 1000+ A/B Tests in Real Time with Flink and Kafka
Expedia had a problem: bad A/B tests were slipping through the cracks in the first 24 hours and hurting revenue. So they built a real-time circuit breaker using to monitor 1,000+ experiments.
If you want to learn how it works and what makes it tricky, check out this article!