<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[Data Tinkerer: Data Engineering]]></title><description><![CDATA[Dive into the latest trends and updates in data engineering!]]></description><link>https://www.datatinkerer.io/s/data-engineering</link><image><url>https://substackcdn.com/image/fetch/$s_!JEdj!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6bea5ccd-f356-4154-a1db-16268300510e_500x500.png</url><title>Data Tinkerer: Data Engineering</title><link>https://www.datatinkerer.io/s/data-engineering</link></image><generator>Substack</generator><lastBuildDate>Wed, 08 Apr 2026 12:33:23 GMT</lastBuildDate><atom:link href="https://www.datatinkerer.io/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[Data Tinkerer]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[datatinkerer@substack.com]]></webMaster><itunes:owner><itunes:email><![CDATA[datatinkerer@substack.com]]></itunes:email><itunes:name><![CDATA[Data Tinkerer]]></itunes:name></itunes:owner><itunes:author><![CDATA[Data Tinkerer]]></itunes:author><googleplay:owner><![CDATA[datatinkerer@substack.com]]></googleplay:owner><googleplay:email><![CDATA[datatinkerer@substack.com]]></googleplay:email><googleplay:author><![CDATA[Data Tinkerer]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[How Notion Scaled AI Q&A to Millions of Workspaces]]></title><description><![CDATA[Kafka, Spark and Ray powering low-latency, high-throughput search pipelines]]></description><link>https://www.datatinkerer.io/p/how-notion-scaled-ai-q-and-a-to-millions-of-workspaces</link><guid isPermaLink="false">https://www.datatinkerer.io/p/how-notion-scaled-ai-q-and-a-to-millions-of-workspaces</guid><dc:creator><![CDATA[Data Tinkerer]]></dc:creator><pubDate>Thu, 26 Mar 2026 04:00:33 GMT</pubDate><enclosure url="https://images.unsplash.com/photo-1683114010575-3ead4403180e?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxfHxub3Rpb258ZW58MHx8fHwxNzc0MjU5NjgzfDA&amp;ixlib=rb-4.1.0&amp;q=80&amp;w=1080" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Fellow Data Tinkerers</p><p>Today we will look at how Notion scaled its AI Q&amp;A to millions of users.</p><p>But before that, I wanted to share with you what you could unlock if you share Data Tinkerer with just <strong>1 more person</strong>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!HsBV!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F084aaae7-58c7-464e-b0f8-7c31523985ef_800x402.gif" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!HsBV!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F084aaae7-58c7-464e-b0f8-7c31523985ef_800x402.gif 424w, https://substackcdn.com/image/fetch/$s_!HsBV!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F084aaae7-58c7-464e-b0f8-7c31523985ef_800x402.gif 848w, https://substackcdn.com/image/fetch/$s_!HsBV!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F084aaae7-58c7-464e-b0f8-7c31523985ef_800x402.gif 1272w, https://substackcdn.com/image/fetch/$s_!HsBV!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F084aaae7-58c7-464e-b0f8-7c31523985ef_800x402.gif 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!HsBV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F084aaae7-58c7-464e-b0f8-7c31523985ef_800x402.gif" width="800" height="402" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/084aaae7-58c7-464e-b0f8-7c31523985ef_800x402.gif&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:402,&quot;width&quot;:800,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:3369150,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/gif&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.datatinkerer.io/i/191742179?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F084aaae7-58c7-464e-b0f8-7c31523985ef_800x402.gif&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!HsBV!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F084aaae7-58c7-464e-b0f8-7c31523985ef_800x402.gif 424w, https://substackcdn.com/image/fetch/$s_!HsBV!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F084aaae7-58c7-464e-b0f8-7c31523985ef_800x402.gif 848w, https://substackcdn.com/image/fetch/$s_!HsBV!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F084aaae7-58c7-464e-b0f8-7c31523985ef_800x402.gif 1272w, https://substackcdn.com/image/fetch/$s_!HsBV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F084aaae7-58c7-464e-b0f8-7c31523985ef_800x402.gif 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>There are 100+ resources to learn all things data (science, engineering, analysis). It includes videos, courses, projects and can be filtered by tech stack (Python, SQL, Spark and etc), skill level (Beginner, Intermediate and so on) provider name or free/paid. So if you know other people who like staying up to date on all things data, please share Data Tinkerer with them!</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.datatinkerer.io/leaderboard?&amp;utm_source=post&quot;,&quot;text&quot;:&quot;Refer a friend&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.datatinkerer.io/leaderboard?&amp;utm_source=post"><span>Refer a friend</span></a></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://images.unsplash.com/photo-1683114010575-3ead4403180e?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxfHxub3Rpb258ZW58MHx8fHwxNzc0MjU5NjgzfDA&amp;ixlib=rb-4.1.0&amp;q=80&amp;w=1080" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://images.unsplash.com/photo-1683114010575-3ead4403180e?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxfHxub3Rpb258ZW58MHx8fHwxNzc0MjU5NjgzfDA&amp;ixlib=rb-4.1.0&amp;q=80&amp;w=1080 424w, https://images.unsplash.com/photo-1683114010575-3ead4403180e?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxfHxub3Rpb258ZW58MHx8fHwxNzc0MjU5NjgzfDA&amp;ixlib=rb-4.1.0&amp;q=80&amp;w=1080 848w, https://images.unsplash.com/photo-1683114010575-3ead4403180e?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxfHxub3Rpb258ZW58MHx8fHwxNzc0MjU5NjgzfDA&amp;ixlib=rb-4.1.0&amp;q=80&amp;w=1080 1272w, https://images.unsplash.com/photo-1683114010575-3ead4403180e?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxfHxub3Rpb258ZW58MHx8fHwxNzc0MjU5NjgzfDA&amp;ixlib=rb-4.1.0&amp;q=80&amp;w=1080 1456w" sizes="100vw"><img src="https://images.unsplash.com/photo-1683114010575-3ead4403180e?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxfHxub3Rpb258ZW58MHx8fHwxNzc0MjU5NjgzfDA&amp;ixlib=rb-4.1.0&amp;q=80&amp;w=1080" width="3840" height="2160" data-attrs="{&quot;src&quot;:&quot;https://images.unsplash.com/photo-1683114010575-3ead4403180e?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxfHxub3Rpb258ZW58MHx8fHwxNzc0MjU5NjgzfDA&amp;ixlib=rb-4.1.0&amp;q=80&amp;w=1080&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:2160,&quot;width&quot;:3840,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;a black and white block with the letter n on it&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="a black and white block with the letter n on it" title="a black and white block with the letter n on it" srcset="https://images.unsplash.com/photo-1683114010575-3ead4403180e?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxfHxub3Rpb258ZW58MHx8fHwxNzc0MjU5NjgzfDA&amp;ixlib=rb-4.1.0&amp;q=80&amp;w=1080 424w, https://images.unsplash.com/photo-1683114010575-3ead4403180e?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxfHxub3Rpb258ZW58MHx8fHwxNzc0MjU5NjgzfDA&amp;ixlib=rb-4.1.0&amp;q=80&amp;w=1080 848w, https://images.unsplash.com/photo-1683114010575-3ead4403180e?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxfHxub3Rpb258ZW58MHx8fHwxNzc0MjU5NjgzfDA&amp;ixlib=rb-4.1.0&amp;q=80&amp;w=1080 1272w, https://images.unsplash.com/photo-1683114010575-3ead4403180e?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxfHxub3Rpb258ZW58MHx8fHwxNzc0MjU5NjgzfDA&amp;ixlib=rb-4.1.0&amp;q=80&amp;w=1080 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Photo by <a href="https://unsplash.com/@maria_shalabaieva">Mariia Shalabaieva</a> on <a href="https://unsplash.com">Unsplash</a></figcaption></figure></div><p>Now, with that out of the way, let&#8217;s get to Notion&#8217;s AI Q&amp;A level up!</p><h3>TL;DR</h3><div><hr></div><h4><strong>Situation</strong></h4><p>Notion launched AI Q&amp;A on top of vector search and quickly faced massive demand across millions of workspaces. The initial system hit limits in capacity, onboarding speed and cost.</p><h4><strong>Task</strong></h4><p>Scale onboarding, keep indexes fresh and reduce rising infrastructure costs. At the same time, simplify a growingly complex architecture without hurting latency.</p><h4><strong>Action</strong></h4><p>They introduced dual ingestion paths, generation-based indexing, serverless architecture and migrated to turbopuffer. Then reduced recomputation with page state tracking and moved embeddings to Ray for unified compute.</p><h4><strong>Result</strong></h4><p>600x onboarding growth, 15x workspace growth and major cost reductions across layers. Latency improved and the system became simpler and more efficient.</p><h4><strong>Use Cases</strong></h4><p>Real-time search indexing, semantic search, document retrieval</p><h4><strong>Tech Stack/Framework</strong></h4><p>Apache Spark, AWS EMR, Apache Airflow, Apache Kafka, AWS S3, DynamoDB, Ray, turbopuffer</p><div><hr></div><h3>Explained further</h3><div><hr></div><h4>Context</h4><p>When <a href="https://www.notion.com/blog/introducing-q-and-a">Notion launched AI Q&amp;A</a> in November 2023, the core idea sounded simple enough: let people ask natural-language questions and retrieve relevant knowledge from across their workspace and connected tools. In practice, that meant building a vector search system that could ingest huge amounts of content, stay fresh as pages changed and do all of it at a cost that made sense at Notion scale.</p><p>That is the real story here. Not just &#8220;vector search powers AI&#8221; but what happens after launch, when adoption jumps faster than expected and the infrastructure underneath has to keep up. Over two years, the Notion team pushed that system through several big transitions: scaling onboarding, dealing with storage pressure, changing database architecture, reworking indexing logic and moving embeddings workloads onto Ray. The headline numbers are hard to ignore: 10x scale and roughly one-tenth the cost.</p><p>This is a good example of how modern AI infrastructure usually evolves. The first version gets the product live. The next few versions are about survival, then simplification, then cost, then latency, then getting rid of all the awkward bits that built up during the rush.</p><div><hr></div><h4>Vector search, explained through Notion&#8217;s lens</h4><p>Traditional keyword search is literal. It works when users type the exact words that exist in the content. It starts falling apart when the wording changes but the meaning stays the same. Someone searching for &#8220;team meeting notes&#8221; may still want a page called &#8220;group standup summary,&#8221; but keyword search does not naturally understand that those are closely related.</p><p>Vector search solves that by representing text as embeddings. Instead of storing only words, it maps text into a high-dimensional space where semantically similar ideas sit closer together. That means retrieval is based on meaning, not exact phrasing.</p><p>For Notion AI, this matters a lot. The system needs to answer questions in natural language by finding useful content across a workspace and even across connected sources like Slack and Google Drive. That is exactly the sort of setup where semantic retrieval becomes more useful than plain lexical matching. A user is not thinking about the title of the page or the exact phrasing inside a paragraph. They are asking a question in their own words and expecting the system to bridge the gap.</p><p>That expectation becomes expensive very quickly.</p><div><hr></div><h4>Part 1: Scaling beyond what the original system expected</h4><p>At launch, Notion&#8217;s ingestion and indexing pipeline had two paths.</p><p>The first was an offline path. Batch jobs running on Apache Spark would chunk existing documents, generate embeddings through an API and bulk-load those vectors into the vector database. This handled the heavy lifting for backfilling large amounts of existing content.</p><p>The second was an online path. Kafka consumers processed page edits in near real time so live workspaces stayed up to date with sub-minute latency.</p><p>It is a practical split. The offline side handles the backlog and large initial loads. The online side keeps things fresh once a workspace is active. Together, the two-path setup gave Notion a way to onboard workspaces at scale without sacrificing freshness for day-to-day edits.</p><p>The vector database itself ran on dedicated &#8216;pod&#8217; clusters, where storage and compute were coupled. The Notion team designed sharding in a way that echoed their Postgres setup: workspace ID was the partitioning key, routing used range-based partitioning and a single config referenced all shards.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!zNlu!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff165c24e-314a-4bfb-ae67-84e69f2b5c63_2172x1428.avif" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!zNlu!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff165c24e-314a-4bfb-ae67-84e69f2b5c63_2172x1428.avif 424w, https://substackcdn.com/image/fetch/$s_!zNlu!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff165c24e-314a-4bfb-ae67-84e69f2b5c63_2172x1428.avif 848w, https://substackcdn.com/image/fetch/$s_!zNlu!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff165c24e-314a-4bfb-ae67-84e69f2b5c63_2172x1428.avif 1272w, https://substackcdn.com/image/fetch/$s_!zNlu!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff165c24e-314a-4bfb-ae67-84e69f2b5c63_2172x1428.avif 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!zNlu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff165c24e-314a-4bfb-ae67-84e69f2b5c63_2172x1428.avif" width="1456" height="957" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f165c24e-314a-4bfb-ae67-84e69f2b5c63_2172x1428.avif&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:957,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:12663,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/avif&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.datatinkerer.io/i/191742179?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff165c24e-314a-4bfb-ae67-84e69f2b5c63_2172x1428.avif&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!zNlu!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff165c24e-314a-4bfb-ae67-84e69f2b5c63_2172x1428.avif 424w, https://substackcdn.com/image/fetch/$s_!zNlu!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff165c24e-314a-4bfb-ae67-84e69f2b5c63_2172x1428.avif 848w, https://substackcdn.com/image/fetch/$s_!zNlu!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff165c24e-314a-4bfb-ae67-84e69f2b5c63_2172x1428.avif 1272w, https://substackcdn.com/image/fetch/$s_!zNlu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff165c24e-314a-4bfb-ae67-84e69f2b5c63_2172x1428.avif 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">pipelines writing into sharded vector database pods (Source: Notion)</figcaption></figure></div><p>That all made sense on paper. Then the product launched and demand was overwhelming.</p><p>Notion quickly built up a waitlist of millions of workspaces that wanted access to Q&amp;A. The problem was no longer whether the system worked. It was how fast it could onboard people without cracking under the pressure.</p><p><strong>When the indexes started to fill up</strong></p><p>Only a month after launch, the original indexes were already nearing capacity.</p><p>That is the kind of problem that sounds good in product meetings and bad in infrastructure meetings. If the indexes filled up, Notion would have to pause onboarding. That would slow down rollout and delay access for everyone waiting.</p><p>The team had two obvious options.</p><p>One was to re-shard incrementally. Clone data into another index, delete half, repeat and keep doing that every couple of weeks as new customers came in.</p><p>The other was to re-shard for the final expected volume. But their vector database provider charged for uptime, so over-provisioning would have been painfully expensive.</p><p>Instead, the Notion team went with a third approach. When a set of indexes got close to full, they provisioned a new set and directed all newly onboarded workspaces there. Each set was assigned a generation ID, which determined where reads and writes should go.</p><p>It is not the prettiest long-term design, but it was a smart short-term move. It avoided repeated re-shard operations and kept onboarding moving. Sometimes the right scaling decision is not the most elegant one. It is the one that buys breathing room without stopping the business.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!a8zu!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4100f229-aff6-4fd9-8fa3-f35e4878a0aa_1990x1218.avif" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!a8zu!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4100f229-aff6-4fd9-8fa3-f35e4878a0aa_1990x1218.avif 424w, https://substackcdn.com/image/fetch/$s_!a8zu!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4100f229-aff6-4fd9-8fa3-f35e4878a0aa_1990x1218.avif 848w, https://substackcdn.com/image/fetch/$s_!a8zu!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4100f229-aff6-4fd9-8fa3-f35e4878a0aa_1990x1218.avif 1272w, https://substackcdn.com/image/fetch/$s_!a8zu!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4100f229-aff6-4fd9-8fa3-f35e4878a0aa_1990x1218.avif 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!a8zu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4100f229-aff6-4fd9-8fa3-f35e4878a0aa_1990x1218.avif" width="1456" height="891" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4100f229-aff6-4fd9-8fa3-f35e4878a0aa_1990x1218.avif&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:891,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:16313,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/avif&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.datatinkerer.io/i/191742179?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4100f229-aff6-4fd9-8fa3-f35e4878a0aa_1990x1218.avif&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!a8zu!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4100f229-aff6-4fd9-8fa3-f35e4878a0aa_1990x1218.avif 424w, https://substackcdn.com/image/fetch/$s_!a8zu!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4100f229-aff6-4fd9-8fa3-f35e4878a0aa_1990x1218.avif 848w, https://substackcdn.com/image/fetch/$s_!a8zu!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4100f229-aff6-4fd9-8fa3-f35e4878a0aa_1990x1218.avif 1272w, https://substackcdn.com/image/fetch/$s_!a8zu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4100f229-aff6-4fd9-8fa3-f35e4878a0aa_1990x1218.avif 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">New index &#8216;generations&#8217; added as capacity fills, routing new workspaces without re-sharding. (Source: Notion)</figcaption></figure></div><p><strong>Turning onboarding into a throughput problem</strong></p><p>Even with the architecture in place, the initial onboarding rate was nowhere near enough. At launch, Notion could onboard only a few hundred workspaces per day. At that pace, clearing a multi-million waitlist would have taken decades which is obviously not a real option.</p><p>So the team pushed hard on throughput. Using Airflow scheduling, pipelining and Spark job tuning, they dramatically increased capacity.</p><p>The results were big:</p><ul><li><p>Daily onboarding capacity increased by <strong>600x</strong></p></li><li><p>Active workspaces grew <strong>15x</strong></p></li><li><p>Vector database capacity expanded <strong>8x</strong></p></li></ul><p>By April 2024, the Q&amp;A waitlist was cleared.</p><p>That is the kind of milestone that looks clean in hindsight but it came with a cost. Managing multiple generations of databases helped during the hypergrowth phase but it also added operational complexity and financial overhead. The team had solved the immediate scaling problem, but the architecture was starting to feel heavy.</p><p>That set up the next phase of the story.</p><div><hr></div><h4>Part 2: Cost becomes the next constraint</h4><p>In May 2024, Notion migrated its embeddings workload from the original dedicated &#8216;pod&#8217; architecture to a serverless setup that decoupled storage from compute and charged based on usage instead of uptime.</p><p>The effect was immediate. Costs dropped by 50 percent from peak usage, translating into several millions of dollars in annual savings.</p><p>That alone would have made the migration worthwhile, but the serverless design also fixed two practical problems. First, it removed the storage capacity constraints that had become a serious scaling bottleneck. Second, it simplified operations because the team no longer had to provision capacity ahead of demand.</p><p>Still, even after cutting costs in half, the annual run rate for vector database spend was still in the millions. From an engineering point of view, this is where things get interesting. The easy win had already happened. Now the team had to go after deeper structural gains.</p><p><strong>A new search foundation (turbopuffer)</strong></p><p>While working on the first round of savings, Notion also evaluated alternative search engines. <a href="https://turbopuffer.com/">turbopuffer</a> stood out because it offered significantly lower projected costs.</p><p>At the time, turbopuffer was a newer player in search. Its architecture was built on object storage with a focus on cost-efficiency and performance. It also supported both managed and bring-your-own-cloud deployment models and it made bulk modification of stored vector objects easier.</p><p>That combination lined up well with what Notion needed.</p><p>After a successful evaluation, the team decided to migrate its entire multi-billion-object workload to turbopuffer in late 2024. Since they were already making a provider switch, they used the migration as a chance to clean up the broader architecture too.</p><p>Several changes happened together.</p><p>First, they fully re-indexed the corpus, increasing write throughput in the offline indexing pipeline to rebuild everything in turbopuffer.</p><p>Second, they upgraded the embeddings model during the migration to be more performant.</p><p>Third, they simplified the architecture. turbopuffer treats each namespace as an independent index which removed the need to think about sharding and generation-based routing in the same way as before.</p><p>Finally, they handled the cutover gradually, migrating one generation at a time and validating correctness before moving on.</p><p>This is a strong pattern: if a migration is painful anyway, use it to pay off other infrastructure debt at the same time.</p><p>The outcome was solid on several fronts:</p><ul><li><p><strong>60 percent cost reduction</strong> on search engine spend</p></li><li><p><strong>35 percent reduction</strong> in AWS EMR compute costs</p></li><li><p>p50 production query latency <strong>improved from 70&#8211;100ms to 50&#8211;70ms</strong></p></li></ul><p>That is a meaningful improvement across cost and performance, which is not always easy to pull off together.</p><p><strong>Avoiding full reprocessing with page state tracking</strong></p><p>The next optimization went after a very expensive inefficiency in the indexing pipeline.</p><p>Notion pages can be long, so the team chunks each page into spans, embeds each span and stores those vectors with metadata such as authors and permissions. In the original implementation, any edit to a page or its properties triggered a full re-chunk, full re-embed and full re-upload of all spans on that page.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ytMS!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0ca76a6-5c49-4f2b-ba2d-f612690b6b83_1000x189.gif" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ytMS!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0ca76a6-5c49-4f2b-ba2d-f612690b6b83_1000x189.gif 424w, https://substackcdn.com/image/fetch/$s_!ytMS!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0ca76a6-5c49-4f2b-ba2d-f612690b6b83_1000x189.gif 848w, https://substackcdn.com/image/fetch/$s_!ytMS!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0ca76a6-5c49-4f2b-ba2d-f612690b6b83_1000x189.gif 1272w, https://substackcdn.com/image/fetch/$s_!ytMS!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0ca76a6-5c49-4f2b-ba2d-f612690b6b83_1000x189.gif 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ytMS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0ca76a6-5c49-4f2b-ba2d-f612690b6b83_1000x189.gif" width="1000" height="189" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e0ca76a6-5c49-4f2b-ba2d-f612690b6b83_1000x189.gif&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:189,&quot;width&quot;:1000,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:2615711,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/gif&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.datatinkerer.io/i/191742179?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0ca76a6-5c49-4f2b-ba2d-f612690b6b83_1000x189.gif&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ytMS!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0ca76a6-5c49-4f2b-ba2d-f612690b6b83_1000x189.gif 424w, https://substackcdn.com/image/fetch/$s_!ytMS!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0ca76a6-5c49-4f2b-ba2d-f612690b6b83_1000x189.gif 848w, https://substackcdn.com/image/fetch/$s_!ytMS!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0ca76a6-5c49-4f2b-ba2d-f612690b6b83_1000x189.gif 1272w, https://substackcdn.com/image/fetch/$s_!ytMS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0ca76a6-5c49-4f2b-ba2d-f612690b6b83_1000x189.gif 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Page &#8594; chunking &#8594; embedding &#8594; vector DB with full reprocessing on every edit. (Source: Notion)</figcaption></figure></div><p>That meant even a tiny change could trigger a lot of unnecessary work.</p><p>The team narrowed the problem down to two things that actually mattered:</p><ol><li><p>The page text changes which means embeddings need updating</p></li><li><p>The metadata changes which means metadata needs updating</p></li></ol><p>To detect those cases, they tracked two hashes per span: one hash for the span text and another for the metadata fields. They chose 64-bit xxHash because it offered a good balance of speed, simplicity, low collision risk and storage footprint.</p><p>For caching, they used DynamoDB. Each page had one record containing the state of all spans on that page, including text and metadata hashes.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Mj4k!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8731bd9-7cf6-489b-bd3b-66077e710a40_1396x858.avif" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Mj4k!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8731bd9-7cf6-489b-bd3b-66077e710a40_1396x858.avif 424w, https://substackcdn.com/image/fetch/$s_!Mj4k!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8731bd9-7cf6-489b-bd3b-66077e710a40_1396x858.avif 848w, https://substackcdn.com/image/fetch/$s_!Mj4k!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8731bd9-7cf6-489b-bd3b-66077e710a40_1396x858.avif 1272w, https://substackcdn.com/image/fetch/$s_!Mj4k!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8731bd9-7cf6-489b-bd3b-66077e710a40_1396x858.avif 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Mj4k!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8731bd9-7cf6-489b-bd3b-66077e710a40_1396x858.avif" width="1396" height="858" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e8731bd9-7cf6-489b-bd3b-66077e710a40_1396x858.avif&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:858,&quot;width&quot;:1396,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:23088,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/avif&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.datatinkerer.io/i/191742179?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8731bd9-7cf6-489b-bd3b-66077e710a40_1396x858.avif&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Mj4k!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8731bd9-7cf6-489b-bd3b-66077e710a40_1396x858.avif 424w, https://substackcdn.com/image/fetch/$s_!Mj4k!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8731bd9-7cf6-489b-bd3b-66077e710a40_1396x858.avif 848w, https://substackcdn.com/image/fetch/$s_!Mj4k!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8731bd9-7cf6-489b-bd3b-66077e710a40_1396x858.avif 1272w, https://substackcdn.com/image/fetch/$s_!Mj4k!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8731bd9-7cf6-489b-bd3b-66077e710a40_1396x858.avif 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Span-level hashing (text + metadata) with DynamoDB state to detect and update only changed spans. (Source: Notion)</figcaption></figure></div><p>The win came from using that state to avoid unnecessary work.</p><p><strong>Case 1: The page text changes</strong></p><p>Imagine Herman Melville editing <em>Moby Dick</em> halfway through a page. Before this improvement, the whole page would have been re-embedded and reloaded. After the change, the system chunks the page, fetches the previous state from DynamoDB and compares text hashes span by span. It can then detect which spans actually changed and only re-embed and reload those.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!xTeN!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8fd33692-a8bf-4d28-bcd7-96a5201504f5_1000x151.gif" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!xTeN!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8fd33692-a8bf-4d28-bcd7-96a5201504f5_1000x151.gif 424w, https://substackcdn.com/image/fetch/$s_!xTeN!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8fd33692-a8bf-4d28-bcd7-96a5201504f5_1000x151.gif 848w, https://substackcdn.com/image/fetch/$s_!xTeN!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8fd33692-a8bf-4d28-bcd7-96a5201504f5_1000x151.gif 1272w, https://substackcdn.com/image/fetch/$s_!xTeN!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8fd33692-a8bf-4d28-bcd7-96a5201504f5_1000x151.gif 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!xTeN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8fd33692-a8bf-4d28-bcd7-96a5201504f5_1000x151.gif" width="1000" height="151" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8fd33692-a8bf-4d28-bcd7-96a5201504f5_1000x151.gif&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:151,&quot;width&quot;:1000,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1891331,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/gif&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.datatinkerer.io/i/191742179?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8fd33692-a8bf-4d28-bcd7-96a5201504f5_1000x151.gif&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!xTeN!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8fd33692-a8bf-4d28-bcd7-96a5201504f5_1000x151.gif 424w, https://substackcdn.com/image/fetch/$s_!xTeN!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8fd33692-a8bf-4d28-bcd7-96a5201504f5_1000x151.gif 848w, https://substackcdn.com/image/fetch/$s_!xTeN!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8fd33692-a8bf-4d28-bcd7-96a5201504f5_1000x151.gif 1272w, https://substackcdn.com/image/fetch/$s_!xTeN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8fd33692-a8bf-4d28-bcd7-96a5201504f5_1000x151.gif 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Only changed spans are re-embedded and updated using page state + text hash comparison. (Source: Notion)</figcaption></figure></div><p>That is the kind of fix that getting the balance right matters. Miss a changed span and search quality suffers. Reprocess too much and cost stays high.</p><p><strong>Case 2: The metadata changes</strong></p><p>Now imagine Melville updates permissions so the page becomes visible to everyone. The permissions metadata changes but the text does not.</p><p>Previously, that still meant re-embedding and reloading the entire page. With the new approach, Notion compares both text and metadata hashes. If the text hashes are unchanged but metadata hashes differ, the system skips embedding entirely and issues a PATCH command to the vector database to update only the metadata. That is much cheaper than recomputing embeddings.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!6qtN!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44aa866c-9f82-4a8f-b065-1e5e780a2175_1000x197.gif" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!6qtN!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44aa866c-9f82-4a8f-b065-1e5e780a2175_1000x197.gif 424w, https://substackcdn.com/image/fetch/$s_!6qtN!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44aa866c-9f82-4a8f-b065-1e5e780a2175_1000x197.gif 848w, https://substackcdn.com/image/fetch/$s_!6qtN!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44aa866c-9f82-4a8f-b065-1e5e780a2175_1000x197.gif 1272w, https://substackcdn.com/image/fetch/$s_!6qtN!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44aa866c-9f82-4a8f-b065-1e5e780a2175_1000x197.gif 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!6qtN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44aa866c-9f82-4a8f-b065-1e5e780a2175_1000x197.gif" width="1000" height="197" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/44aa866c-9f82-4a8f-b065-1e5e780a2175_1000x197.gif&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:197,&quot;width&quot;:1000,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:2162583,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/gif&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.datatinkerer.io/i/191742179?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44aa866c-9f82-4a8f-b065-1e5e780a2175_1000x197.gif&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!6qtN!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44aa866c-9f82-4a8f-b065-1e5e780a2175_1000x197.gif 424w, https://substackcdn.com/image/fetch/$s_!6qtN!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44aa866c-9f82-4a8f-b065-1e5e780a2175_1000x197.gif 848w, https://substackcdn.com/image/fetch/$s_!6qtN!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44aa866c-9f82-4a8f-b065-1e5e780a2175_1000x197.gif 1272w, https://substackcdn.com/image/fetch/$s_!6qtN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44aa866c-9f82-4a8f-b065-1e5e780a2175_1000x197.gif 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Metadata-only changes skip embeddings and update spans via PATCH in the vector DB. (Source: Notion)</figcaption></figure></div><p>Across these changes, the Page State Project reduced data volume by 70 percent. That saved money on both embeddings API costs and vector database write costs.</p><p><strong>Moving embeddings to Ray (indexing)</strong></p><p>In July 2025, Notion started migrating its near real-time embeddings pipeline to <a href="https://www.ray.io/">Ray</a> on <a href="https://www.anyscale.com/">Anyscale</a>.</p><p>The motivation came from several pain points in the earlier setup.</p><p>One was the <strong>&#8216;double compute&#8217; problem</strong>. Spark on EMR handled preprocessing like chunking, transformations and API orchestration, but embeddings themselves were still generated through an external provider that charged per token. So the team was paying for both preprocessing infrastructure and embedding API usage.</p><p>Another issue was <strong>endpoint reliability</strong>. Fresh search indexes depended on the stability of an external embeddings API.</p><p>The third problem was <strong>clunky pipelining</strong>. To smooth traffic and avoid API rate limits, the team had built a multi-step handoff process where Spark jobs passed batches through S3. It worked but it was clunky.</p><p>Ray and Anyscale gave Notion a cleaner path.</p><p>Ray let the team run open-source embedding models directly, which meant more model flexibility and less dependence on external providers. By consolidating preprocessing and inference onto a single compute layer, they could cut out the double-compute setup. Ray also supports pipelining CPU-bound work such as chunking and page-state detection with GPU-bound embedding generation on the same nodes, which helps keep utilization high.</p><p>There was also a developer productivity angle. Anyscale workspaces let engineers write and test pipelines from their preferred tools without having to provision infrastructure manually.</p><p>And on the product side, self-hosting embeddings removed a third-party API hop from the user-facing path, which helped reduce end-to-end latency.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!UN1z!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a163616-e23f-465c-a9d4-e974253208d3_1000x362.gif" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!UN1z!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a163616-e23f-465c-a9d4-e974253208d3_1000x362.gif 424w, https://substackcdn.com/image/fetch/$s_!UN1z!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a163616-e23f-465c-a9d4-e974253208d3_1000x362.gif 848w, https://substackcdn.com/image/fetch/$s_!UN1z!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a163616-e23f-465c-a9d4-e974253208d3_1000x362.gif 1272w, https://substackcdn.com/image/fetch/$s_!UN1z!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a163616-e23f-465c-a9d4-e974253208d3_1000x362.gif 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!UN1z!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a163616-e23f-465c-a9d4-e974253208d3_1000x362.gif" width="1000" height="362" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2a163616-e23f-465c-a9d4-e974253208d3_1000x362.gif&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:362,&quot;width&quot;:1000,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1537621,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/gif&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.datatinkerer.io/i/191742179?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a163616-e23f-465c-a9d4-e974253208d3_1000x362.gif&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!UN1z!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a163616-e23f-465c-a9d4-e974253208d3_1000x362.gif 424w, https://substackcdn.com/image/fetch/$s_!UN1z!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a163616-e23f-465c-a9d4-e974253208d3_1000x362.gif 848w, https://substackcdn.com/image/fetch/$s_!UN1z!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a163616-e23f-465c-a9d4-e974253208d3_1000x362.gif 1272w, https://substackcdn.com/image/fetch/$s_!UN1z!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a163616-e23f-465c-a9d4-e974253208d3_1000x362.gif 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Ray natively supports pipelining CPU bound tasks (chunking, detecting page state) with GPU bound embeddings generation within the same node. (Source: Notion)</figcaption></figure></div><p>The rollout is still ongoing, but early results suggest a 90+ percent reduction in embeddings infrastructure costs. That is a major shift in how the economics of the system work.</p><p><strong>Real-time query embeddings on Ray (serving)</strong></p><p>Indexing is only half the picture. When users or agents search in Notion, queries must also be embedded on the fly before the vector database can be searched.</p><p>That makes serving latency-sensitive. The embedding has to happen fast enough that the search still feels responsive.</p><p>Hosting large embedding models is not trivial. GPU allocation, ingress routing, replication and autoscaling all matter, especially when traffic is uneven and expectations for responsiveness are high.</p><p><a href="https://docs.ray.io/en/latest/serve/index.html">Ray Serve</a> helped Notion here by handling much of that operational layer out of the box. The team could wrap open-source embedding models in persistent deployments that stay loaded on GPU, configure request batching and replication and manage the serving setup with normal Python code plus YAML-based infrastructure configuration.</p><p>That is a pretty practical endpoint for the broader journey.</p><p>What started as a vector search stack built quickly enough to launch AI Q&amp;A turned into a much more refined system: simpler in some places, more selective in others, cheaper across multiple layers and faster where users feel it. The interesting part is not any single tool choice. It is how the Notion team kept removing bottlenecks one by one: storage limits, awkward shard routing, redundant recomputation, external API dependence and fragmented compute layers.</p><p>That is usually what mature AI infrastructure looks like in the real world. Not one giant redesign. A sequence of sharp decisions, each fixing the thing that has become too expensive, too slow or too annoying to keep around.</p><div><hr></div><h3>The full scoop</h3><p>To learn more about this, check <a href="https://www.notion.com/blog/two-years-of-vector-search-at-notion">Notion's Engineering Blog</a> post on this topic</p><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.datatinkerer.io/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">If you liked this post and don&#8217;t want to miss the next one, subscribe to Data Tinkerer!</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p>If you are already subscribed and enjoyed the article, please give it a like and/or share it others, really appreciate it &#128591;</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.datatinkerer.io/p/how-notion-scaled-ai-q-and-a-to-millions-of-workspaces?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.datatinkerer.io/p/how-notion-scaled-ai-q-and-a-to-millions-of-workspaces?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p><div><hr></div><h3>Keep learning</h3><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;0dbf0d77-87fd-4655-82da-31cc841a6d73&quot;,&quot;caption&quot;:&quot;LinkedIn pushed Venice to handle 175M+ lookups per second while ingesting 230M writes per second.<br /><br />This piece breaks down how they balanced compaction, CPU bottlenecks and adaptive throttling to scale ingestion under eventual consistency.&quot;,&quot;cta&quot;:&quot;Read full story&quot;,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;lg&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;How LinkedIn Built a Pipeline That Scales to 230M Records/sec Without Breaking SLAs&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:291590442,&quot;name&quot;:&quot;Data Tinkerer&quot;,&quot;bio&quot;:&quot;Head of analytics sharing deep dives and learnings about AI and all things data (science, engineering, analysis)&quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/83d3bbe0-5fb8-4f8d-9b74-036abbd6fec9_500x500.png&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2026-02-19T04:00:52.353Z&quot;,&quot;cover_image&quot;:&quot;https://images.unsplash.com/photo-1712217559097-cc2aaf698767?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxM3x8bGlua2VkaW58ZW58MHx8fHwxNzcxMTMwNjQ1fDA&amp;ixlib=rb-4.1.0&amp;q=80&amp;w=1080&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://www.datatinkerer.io/p/how-linkedin-built-a-pipeline-that-scales-to-230-million-records&quot;,&quot;section_name&quot;:&quot;Data Engineering&quot;,&quot;video_upload_id&quot;:null,&quot;id&quot;:187999868,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:10,&quot;comment_count&quot;:0,&quot;publication_id&quot;:3422740,&quot;publication_name&quot;:&quot;Data Tinkerer&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!JEdj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6bea5ccd-f356-4154-a1db-16268300510e_500x500.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;765aa4a7-c63b-4175-8423-aae14d8d54cb&quot;,&quot;caption&quot;:&quot;Grab needed to detect schema and value issues in Kafka streams while data was still in motion.<br /><br />This piece breaks down how they introduced real-time checks and fast alerts to catch poison events before they spread.&quot;,&quot;cta&quot;:&quot;Read full story&quot;,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;lg&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;How Grab Detects Data Issues across 100+ Kafka Topics Before They Spread&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:291590442,&quot;name&quot;:&quot;Data Tinkerer&quot;,&quot;bio&quot;:&quot;Head of analytics sharing deep dives and learnings about AI and all things data (science, engineering, analysis)&quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/83d3bbe0-5fb8-4f8d-9b74-036abbd6fec9_500x500.png&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2026-01-15T04:15:57.055Z&quot;,&quot;cover_image&quot;:&quot;https://images.unsplash.com/photo-1624957083543-9a67140fabfd?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHw0fHxncmFifGVufDB8fHx8MTc2NzkzMTIwOXww&amp;ixlib=rb-4.1.0&amp;q=80&amp;w=1080&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://www.datatinkerer.io/p/how-grab-detects-data-issues-across-100-kafka-topics&quot;,&quot;section_name&quot;:&quot;Data Engineering&quot;,&quot;video_upload_id&quot;:null,&quot;id&quot;:183755897,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:15,&quot;comment_count&quot;:0,&quot;publication_id&quot;:3422740,&quot;publication_name&quot;:&quot;Data Tinkerer&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!JEdj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6bea5ccd-f356-4154-a1db-16268300510e_500x500.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div>]]></content:encoded></item><item><title><![CDATA[How LinkedIn Built a Pipeline That Scales to 230M Records/sec Without Breaking SLAs]]></title><description><![CDATA[From partition strategy to adaptive throttling, the playbook behind Venice&#8217;s ingestion evolution.]]></description><link>https://www.datatinkerer.io/p/how-linkedin-built-a-pipeline-that-scales-to-230-million-records</link><guid isPermaLink="false">https://www.datatinkerer.io/p/how-linkedin-built-a-pipeline-that-scales-to-230-million-records</guid><dc:creator><![CDATA[Data Tinkerer]]></dc:creator><pubDate>Thu, 19 Feb 2026 04:00:52 GMT</pubDate><enclosure url="https://images.unsplash.com/photo-1712217559097-cc2aaf698767?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxM3x8bGlua2VkaW58ZW58MHx8fHwxNzcxMTMwNjQ1fDA&amp;ixlib=rb-4.1.0&amp;q=80&amp;w=1080" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Fellow Data Tinkerers</p><p>Today we will look at how LinkedIn ingests data at scale.</p><p>But before that, I wanted to share with you what you could unlock if you share Data Tinkerer with just <strong>1 more person</strong>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!y5YD!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42e71ce5-0a12-4514-b7a0-ddfd0460289a_800x402.gif" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!y5YD!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42e71ce5-0a12-4514-b7a0-ddfd0460289a_800x402.gif 424w, https://substackcdn.com/image/fetch/$s_!y5YD!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42e71ce5-0a12-4514-b7a0-ddfd0460289a_800x402.gif 848w, https://substackcdn.com/image/fetch/$s_!y5YD!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42e71ce5-0a12-4514-b7a0-ddfd0460289a_800x402.gif 1272w, https://substackcdn.com/image/fetch/$s_!y5YD!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42e71ce5-0a12-4514-b7a0-ddfd0460289a_800x402.gif 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!y5YD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42e71ce5-0a12-4514-b7a0-ddfd0460289a_800x402.gif" width="800" height="402" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/42e71ce5-0a12-4514-b7a0-ddfd0460289a_800x402.gif&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:402,&quot;width&quot;:800,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:3369150,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/gif&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.datatinkerer.io/i/187999868?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42e71ce5-0a12-4514-b7a0-ddfd0460289a_800x402.gif&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!y5YD!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42e71ce5-0a12-4514-b7a0-ddfd0460289a_800x402.gif 424w, https://substackcdn.com/image/fetch/$s_!y5YD!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42e71ce5-0a12-4514-b7a0-ddfd0460289a_800x402.gif 848w, https://substackcdn.com/image/fetch/$s_!y5YD!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42e71ce5-0a12-4514-b7a0-ddfd0460289a_800x402.gif 1272w, https://substackcdn.com/image/fetch/$s_!y5YD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42e71ce5-0a12-4514-b7a0-ddfd0460289a_800x402.gif 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>There are 100+ resources to learn all things data (science, engineering, analysis). It includes videos, courses, projects and can be filtered by tech stack (Python, SQL, Spark and etc), skill level (Beginner, Intermediate and so on) provider name or free/paid. So if you know other people who like staying up to date on all things data, please share Data Tinkerer with them!</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.datatinkerer.io/leaderboard?&amp;utm_source=post&quot;,&quot;text&quot;:&quot;Refer a friend&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.datatinkerer.io/leaderboard?&amp;utm_source=post"><span>Refer a friend</span></a></p><p>Now, with that out of the way, let&#8217;s get to Venice: LinkedIn&#8217;s data storage platform</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://images.unsplash.com/photo-1712217559097-cc2aaf698767?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxM3x8bGlua2VkaW58ZW58MHx8fHwxNzcxMTMwNjQ1fDA&amp;ixlib=rb-4.1.0&amp;q=80&amp;w=1080" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://images.unsplash.com/photo-1712217559097-cc2aaf698767?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxM3x8bGlua2VkaW58ZW58MHx8fHwxNzcxMTMwNjQ1fDA&amp;ixlib=rb-4.1.0&amp;q=80&amp;w=1080 424w, https://images.unsplash.com/photo-1712217559097-cc2aaf698767?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxM3x8bGlua2VkaW58ZW58MHx8fHwxNzcxMTMwNjQ1fDA&amp;ixlib=rb-4.1.0&amp;q=80&amp;w=1080 848w, https://images.unsplash.com/photo-1712217559097-cc2aaf698767?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxM3x8bGlua2VkaW58ZW58MHx8fHwxNzcxMTMwNjQ1fDA&amp;ixlib=rb-4.1.0&amp;q=80&amp;w=1080 1272w, https://images.unsplash.com/photo-1712217559097-cc2aaf698767?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxM3x8bGlua2VkaW58ZW58MHx8fHwxNzcxMTMwNjQ1fDA&amp;ixlib=rb-4.1.0&amp;q=80&amp;w=1080 1456w" sizes="100vw"><img src="https://images.unsplash.com/photo-1712217559097-cc2aaf698767?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxM3x8bGlua2VkaW58ZW58MHx8fHwxNzcxMTMwNjQ1fDA&amp;ixlib=rb-4.1.0&amp;q=80&amp;w=1080" width="5184" height="3456" data-attrs="{&quot;src&quot;:&quot;https://images.unsplash.com/photo-1712217559097-cc2aaf698767?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxM3x8bGlua2VkaW58ZW58MHx8fHwxNzcxMTMwNjQ1fDA&amp;ixlib=rb-4.1.0&amp;q=80&amp;w=1080&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:3456,&quot;width&quot;:5184,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;a computer screen with a facebook page on it&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="a computer screen with a facebook page on it" title="a computer screen with a facebook page on it" srcset="https://images.unsplash.com/photo-1712217559097-cc2aaf698767?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxM3x8bGlua2VkaW58ZW58MHx8fHwxNzcxMTMwNjQ1fDA&amp;ixlib=rb-4.1.0&amp;q=80&amp;w=1080 424w, https://images.unsplash.com/photo-1712217559097-cc2aaf698767?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxM3x8bGlua2VkaW58ZW58MHx8fHwxNzcxMTMwNjQ1fDA&amp;ixlib=rb-4.1.0&amp;q=80&amp;w=1080 848w, https://images.unsplash.com/photo-1712217559097-cc2aaf698767?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxM3x8bGlua2VkaW58ZW58MHx8fHwxNzcxMTMwNjQ1fDA&amp;ixlib=rb-4.1.0&amp;q=80&amp;w=1080 1272w, https://images.unsplash.com/photo-1712217559097-cc2aaf698767?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxM3x8bGlua2VkaW58ZW58MHx8fHwxNzcxMTMwNjQ1fDA&amp;ixlib=rb-4.1.0&amp;q=80&amp;w=1080 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Photo by <a href="https://unsplash.com/@getswello">Swello</a> on <a href="https://unsplash.com">Unsplash</a></figcaption></figure></div><h3>TL;DR</h3><div><hr></div><h4><strong>Situation</strong></h4><p>Venice powers LinkedIn&#8217;s AI-driven products and has scaled to 2,600+ stores with workloads spanning bulk loads, streaming updates and active/active replication. The ingestion pipeline had to handle throughput-heavy, CPU-heavy and latency-sensitive traffic under eventual consistency.</p><h4><strong>Task</strong></h4><p>Redesign ingestion to scale to 230M writes/sec while preserving ordering and protecting read and write SLAs. Support hybrid stores, partial updates and multi&#8211;data center replication without destabilizing clusters.</p><h4><strong>Action</strong></h4><p>Scaled bulk ingestion with partition tuning, shared consumer/writer pools and direct SST writes; tuned RocksDB via compaction triggers and BlobDB to manage amplification. Optimized CPU-heavy paths using Fast-Avro and parallel processing, then enforced priority pools and adaptive throttling to protect current-version latency.</p><h4><strong>Result</strong></h4><p>Venice now handles 175M+ key lookups/sec and 230M+ writes/sec in production. It maintains a write latency SLA under 10 minutes while safeguarding read latency as the top priority.</p><h4><strong>Use Cases</strong></h4><p>Large-scale feature stores, real-time recommendation systems, hybrid data serving, low-latency notification</p><h4><strong>Tech Stack/Framework</strong></h4><p>Apache Spark, Apache Samza, Apache Kafka, RocksDB, Fast-Avro, Adaptive Throttling</p><div><hr></div><h3>Explained further</h3><div><hr></div><h4>Background</h4><p><a href="https://github.com/linkedin/venice">Venice</a> is an open-source derived data storage platform and LinkedIn&#8217;s default storage layer for online AI use cases. It sits behind products like People You May Know, feed, videos, ads, notifications, the A/B testing platform, LinkedIn Learning and more.</p><p>Since Venice launched internally in 2016 it has scaled from a handful of stores to over 2,600 production stores. The workloads also evolved a lot. It started with &#8220;just bulk load a dataset&#8221; and grew into a mix of:</p><ul><li><p>Bulk loading huge offline datasets</p></li><li><p>Nearline streaming updates</p></li><li><p>Active/active replication across data centers</p></li><li><p>Partial updates that merge fields and collections</p></li><li><p>Deterministic write latency expectations under eventual consistency</p></li></ul><p>This post walks through how the ingestion pipeline was revamped to hit <strong>230 million records per second in production</strong>, what changed across the architecture, which optimizations moved the needle and how different workload types get tuned. A lot of these ideas are portable if you run any distributed ingestion system where ordering, throughput and predictable latency all matter at once.</p><div><hr></div><h4>Venice overall ingestion pipeline</h4><p>At a high level, store owners write to Venice through three paths:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!BTop!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda1d3a94-d249-4fcc-82c0-50950cf13705_600x297.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!BTop!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda1d3a94-d249-4fcc-82c0-50950cf13705_600x297.png 424w, https://substackcdn.com/image/fetch/$s_!BTop!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda1d3a94-d249-4fcc-82c0-50950cf13705_600x297.png 848w, https://substackcdn.com/image/fetch/$s_!BTop!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda1d3a94-d249-4fcc-82c0-50950cf13705_600x297.png 1272w, https://substackcdn.com/image/fetch/$s_!BTop!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda1d3a94-d249-4fcc-82c0-50950cf13705_600x297.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!BTop!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda1d3a94-d249-4fcc-82c0-50950cf13705_600x297.png" width="600" height="297" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/da1d3a94-d249-4fcc-82c0-50950cf13705_600x297.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:297,&quot;width&quot;:600,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:27952,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.datatinkerer.io/i/187999868?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda1d3a94-d249-4fcc-82c0-50950cf13705_600x297.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!BTop!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda1d3a94-d249-4fcc-82c0-50950cf13705_600x297.png 424w, https://substackcdn.com/image/fetch/$s_!BTop!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda1d3a94-d249-4fcc-82c0-50950cf13705_600x297.png 848w, https://substackcdn.com/image/fetch/$s_!BTop!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda1d3a94-d249-4fcc-82c0-50950cf13705_600x297.png 1272w, https://substackcdn.com/image/fetch/$s_!BTop!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda1d3a94-d249-4fcc-82c0-50950cf13705_600x297.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Venice overall ingestion pipeline (Source: LinkedIn)</figcaption></figure></div><ol><li><p><strong>Bulk loads</strong> from an offline processing platform (example: Spark)</p></li><li><p><strong>Nearline writes</strong> from a streaming processing platform (example: Samza)</p></li><li><p><strong>Direct writes</strong> from online applications</p></li></ol><p>No matter which path you take, the writes all pass through an intermediate PubSub broker layer. From there, the Venice Storage Node (VSN) consumes messages and persists data locally using RocksDB (an embedded key-value store).</p><p>The pipeline sounds straightforward until you operate it at scale. The same ingestion path has to support very different workloads. Some are throughput-driven (bootstrapping a massive store). Some are latency-driven (current-version updates). Some are CPU-heavy (partial updates and conflict resolution). Some are I/O-heavy (compaction, SST churn).</p><p>The following sections will look at the challenges and how the LinkedIn team resolved them.</p><div><hr></div><h4>Use case 1: bootstrapping from offline dataset</h4><p>Venice users can run bulk load jobs using offline processing platforms such as Spark to push new data versions to Venice stores. The hard part is performance for large or massive stores. If you want to find bottlenecks you need to understand the ingestion path end to end.</p><p><strong>What happens during a bulk load</strong></p><ul><li><p>A Venice Push Job (VPJ) creates a new version topic for the new store version, split into multiple partitions</p></li><li><p>The Spark job uses a map-reduce framework to produce messages to that version topic</p></li><li><p>It keeps one reducer per topic partition so message ordering is preserved</p></li><li><p>On the other side, the VSN spins up consumers, reads messages and persists them into RocksDB</p></li><li><p>There is one RocksDB instance per topic partition</p></li></ul><p>So you can hit bottlenecks in three obvious places:</p><ol><li><p>producing</p></li><li><p>consuming</p></li><li><p>persisting</p></li></ol><p>Production experience says you will hit all three, just not on the same day.</p><p><strong>Improving producing and consuming throughput</strong></p><p>The usual first lever is increasing the number of partitions for large stores so you can use more of the PubSub cluster capacity. More partitions tends to mean more parallelism and more throughput.</p><p>But it comes with trade-offs:</p><ul><li><p>more partitions means more management overhead across Venice and PubSub</p></li><li><p>there is a throughput ceiling per PubSub broker</p></li></ul><p>So partition count is not a free lunch. It&#8217;s a knob that buys you throughput and charges you complexity.</p><p><strong>Enhancing consumption scalability</strong></p><p>To keep up with production, VSN uses shared consumer pools across all hosted stores.</p><p>Instead of &#8220;one store version, one set of consumers,&#8221; each store version can use multiple consumers by distributing hosted partitions among them. The point is to keep multiple connections per PubSub broker to speed up consumption (similar to a <a href="https://en.wikipedia.org/wiki/Download_manager">Download Manager</a>).</p><p>The pool approach also does something boring but important: it sets an upper limit on total consumers which puts a ceiling on cost.</p><p><strong>Optimizing I/O performance</strong></p><p>VSN uses a shared writer pool to persist changes concurrently across multiple RocksDB instances and use local SSD capacity effectively.</p><p>Ordering is critical in Venice so for any given RocksDB instance there is only one writer actively writing to it. You still get concurrency across instances, not inside one instance which is the compromise that keeps ordering intact.</p><p><strong>Minimizing memory overhead</strong></p><p>Because messages for a partition are strictly ordered (thanks to the map-reduce framework), Venice uses <a href="https://github.com/facebook/rocksdb/wiki/creating-and-ingesting-sst-files">RocksDB&#8217;s SSTFileWriter</a> to generate SST files directly. That significantly reduces memory overhead during ingestion.</p><p><strong>Ingestion workflow in Venice Server</strong></p><p>Put together, the optimized workflow is basically: use the PubSub layer for distribution, use consumer pools for scalable reads, use writer pools for SSD throughput, preserve ordering by design and avoid memory blowups by writing SST files directly.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!pbHX!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7bb8e50b-527c-4d5b-81fe-4a43bb0f3bc1_1200x944.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!pbHX!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7bb8e50b-527c-4d5b-81fe-4a43bb0f3bc1_1200x944.png 424w, https://substackcdn.com/image/fetch/$s_!pbHX!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7bb8e50b-527c-4d5b-81fe-4a43bb0f3bc1_1200x944.png 848w, https://substackcdn.com/image/fetch/$s_!pbHX!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7bb8e50b-527c-4d5b-81fe-4a43bb0f3bc1_1200x944.png 1272w, https://substackcdn.com/image/fetch/$s_!pbHX!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7bb8e50b-527c-4d5b-81fe-4a43bb0f3bc1_1200x944.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!pbHX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7bb8e50b-527c-4d5b-81fe-4a43bb0f3bc1_1200x944.png" width="1200" height="944" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7bb8e50b-527c-4d5b-81fe-4a43bb0f3bc1_1200x944.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:944,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:191738,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.datatinkerer.io/i/187999868?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7bb8e50b-527c-4d5b-81fe-4a43bb0f3bc1_1200x944.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!pbHX!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7bb8e50b-527c-4d5b-81fe-4a43bb0f3bc1_1200x944.png 424w, https://substackcdn.com/image/fetch/$s_!pbHX!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7bb8e50b-527c-4d5b-81fe-4a43bb0f3bc1_1200x944.png 848w, https://substackcdn.com/image/fetch/$s_!pbHX!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7bb8e50b-527c-4d5b-81fe-4a43bb0f3bc1_1200x944.png 1272w, https://substackcdn.com/image/fetch/$s_!pbHX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7bb8e50b-527c-4d5b-81fe-4a43bb0f3bc1_1200x944.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Optimised Venice pipeline (Source: LinkedIn)</figcaption></figure></div><div><hr></div><h4>Use case 2: hybrid store</h4><p>Venice supports Lambda architecture style use cases by merging updates from both <strong>bulk loads</strong> and <strong>nearline writes</strong>. Users query a single store and get a unified view.</p><p><strong>Venice hybrid store workflow</strong></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!BaZo!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94bfa8e8-d929-494f-ad6b-b512ede167aa_1024x375.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!BaZo!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94bfa8e8-d929-494f-ad6b-b512ede167aa_1024x375.png 424w, https://substackcdn.com/image/fetch/$s_!BaZo!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94bfa8e8-d929-494f-ad6b-b512ede167aa_1024x375.png 848w, https://substackcdn.com/image/fetch/$s_!BaZo!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94bfa8e8-d929-494f-ad6b-b512ede167aa_1024x375.png 1272w, https://substackcdn.com/image/fetch/$s_!BaZo!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94bfa8e8-d929-494f-ad6b-b512ede167aa_1024x375.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!BaZo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94bfa8e8-d929-494f-ad6b-b512ede167aa_1024x375.png" width="1024" height="375" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/94bfa8e8-d929-494f-ad6b-b512ede167aa_1024x375.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:375,&quot;width&quot;:1024,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:64710,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.datatinkerer.io/i/187999868?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94bfa8e8-d929-494f-ad6b-b512ede167aa_1024x375.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!BaZo!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94bfa8e8-d929-494f-ad6b-b512ede167aa_1024x375.png 424w, https://substackcdn.com/image/fetch/$s_!BaZo!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94bfa8e8-d929-494f-ad6b-b512ede167aa_1024x375.png 848w, https://substackcdn.com/image/fetch/$s_!BaZo!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94bfa8e8-d929-494f-ad6b-b512ede167aa_1024x375.png 1272w, https://substackcdn.com/image/fetch/$s_!BaZo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94bfa8e8-d929-494f-ad6b-b512ede167aa_1024x375.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Hybrid store workflow (Source: LinkedIn)</figcaption></figure></div><p>How it works:</p><ul><li><p>each bulk load creates a new store version</p></li><li><p>that version has a new Kafka topic and a new database instance</p></li><li><p>real-time updates produced by a Samza job via a real-time topic are appended to both version topics to keep them current</p></li><li><p>once the new version catches up fully, it is swapped in as the active version to serve reads</p></li></ul><p>The hybrid store is important because it gives you a clean &#8220;new version build&#8221; story without losing real-time freshness. But it creates a new challenge: the database transitions from <strong>read-only</strong> to <strong>read-write</strong>.</p><p>That&#8217;s where <a href="https://github.com/facebook/rocksdb/wiki">RocksDB</a> tuning matters, because duplicates start showing up more often. Keys get updated or deleted after they were inserted. RocksDB uses <a href="https://github.com/facebook/rocksdb/wiki/Compaction">log compaction</a> to remove stale entries, but that compaction has overhead: scan, merge, rewrite SST files, consume CPU, I/O and disk.</p><p>So the core problem becomes: tune RocksDB so you can balance <a href="https://github.com/facebook/rocksdb/wiki/RocksDB-Tuning-Guide#amplification-factors">three competing types of pain.</a></p><ul><li><p><strong>Write amplification</strong>: bytes written to storage vs bytes written to the DB</p></li><li><p><strong>Read amplification</strong>: number of disk reads per query</p></li><li><p><strong>Space amplification</strong>: size of DB files on disk vs the actual data size</p></li></ul><p>Venice uses <a href="https://github.com/facebook/rocksdb/wiki/Leveled-Compaction">leveled compaction</a> by default and relies primarily on two methods to balance those trade-offs.</p><p><strong>1. Tuning the compaction trigger</strong></p><p>The key setting here is:</p><ul><li><p><strong>level0_file_num_compaction_trigger</strong></p></li></ul><p>This controls the max number of files allowed in Level-0. Once you exceed it, compaction kicks in to push SST files from Level-0 to Level-1 and onward as upper levels fill.</p><p>Why it matters:</p><ul><li><p>higher threshold &#8594; fewer compactions &#8594; lower write amplification</p></li><li><p>but also more Level-0 files &#8594; higher read amplification since reads may need to scan multiple files</p></li><li><p>plus higher space amplification because duplicates hang around longer</p></li></ul><p>Venice tunes this per cluster because clusters have different bottlenecks:</p><ul><li><p><strong>memory-serving clusters</strong> want data in RAM to speed up lookups. Memory is the limiting resource, so they set a <strong>lower threshold</strong> to reduce space amplification</p></li><li><p><strong>disk-serving clusters</strong> are often limited by disk I/O, so they set a <strong>higher threshold</strong> to reduce compaction frequency and lower disk write rate</p></li></ul><p>This is a practical tuning philosophy: tune to your real bottleneck, not a generic best practice.</p><p><strong>2. RocksDB BlobDB integration</strong></p><p><a href="https://github.com/facebook/rocksdb/wiki/BlobDB">BlobDB</a> is aimed at large-value workloads through key-value separation:</p><ul><li><p>Large values go into blob files</p></li><li><p>LSM tree stores small pointers</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!DT0h!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74d11f76-4d71-47dd-a41a-37f0bc48666a_1200x447.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!DT0h!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74d11f76-4d71-47dd-a41a-37f0bc48666a_1200x447.png 424w, https://substackcdn.com/image/fetch/$s_!DT0h!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74d11f76-4d71-47dd-a41a-37f0bc48666a_1200x447.png 848w, https://substackcdn.com/image/fetch/$s_!DT0h!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74d11f76-4d71-47dd-a41a-37f0bc48666a_1200x447.png 1272w, https://substackcdn.com/image/fetch/$s_!DT0h!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74d11f76-4d71-47dd-a41a-37f0bc48666a_1200x447.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!DT0h!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74d11f76-4d71-47dd-a41a-37f0bc48666a_1200x447.png" width="1200" height="447" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/74d11f76-4d71-47dd-a41a-37f0bc48666a_1200x447.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:447,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:156086,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.datatinkerer.io/i/187999868?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74d11f76-4d71-47dd-a41a-37f0bc48666a_1200x447.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!DT0h!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74d11f76-4d71-47dd-a41a-37f0bc48666a_1200x447.png 424w, https://substackcdn.com/image/fetch/$s_!DT0h!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74d11f76-4d71-47dd-a41a-37f0bc48666a_1200x447.png 848w, https://substackcdn.com/image/fetch/$s_!DT0h!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74d11f76-4d71-47dd-a41a-37f0bc48666a_1200x447.png 1272w, https://substackcdn.com/image/fetch/$s_!DT0h!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74d11f76-4d71-47dd-a41a-37f0bc48666a_1200x447.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">RocksDB BlobDB structure</figcaption></figure></div><p>This avoids copying large values repeatedly during compaction, reducing write amplification. The cost is additional space amplification because blobs can become unreferenced and require garbage collection.</p><p>For Venice, BlobDB integration reduced write amplification significantly in multi-tenant clusters, especially for large-value use cases. The reported impact here is big: <strong>more than a 50% reduction of disk write throughput</strong>. That matters because it avoided scaling out clusters when CPU and storage space were still available.</p><p>The win here is: you stop paying the compaction tax over and over on the same large payloads.</p><div><hr></div><h4>Use case 3: Active/active replication with partial update</h4><p>Venice guarantees eventual consistency, not strong consistency. That matters because it means you cannot just do read-modify-write operations directly due to write delays.</p><p>To handle this, Venice introduces <strong>partial update</strong>, a specialized operation that supports field-level updates and collection merges.</p><p><strong>Venice partial update workflow</strong></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ay5v!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64784492-8dfb-479c-8521-7b2ce0f62c5d_840x1320.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ay5v!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64784492-8dfb-479c-8521-7b2ce0f62c5d_840x1320.png 424w, https://substackcdn.com/image/fetch/$s_!ay5v!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64784492-8dfb-479c-8521-7b2ce0f62c5d_840x1320.png 848w, https://substackcdn.com/image/fetch/$s_!ay5v!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64784492-8dfb-479c-8521-7b2ce0f62c5d_840x1320.png 1272w, https://substackcdn.com/image/fetch/$s_!ay5v!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64784492-8dfb-479c-8521-7b2ce0f62c5d_840x1320.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ay5v!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64784492-8dfb-479c-8521-7b2ce0f62c5d_840x1320.png" width="840" height="1320" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/64784492-8dfb-479c-8521-7b2ce0f62c5d_840x1320.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1320,&quot;width&quot;:840,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:279575,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.datatinkerer.io/i/187999868?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64784492-8dfb-479c-8521-7b2ce0f62c5d_840x1320.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ay5v!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64784492-8dfb-479c-8521-7b2ce0f62c5d_840x1320.png 424w, https://substackcdn.com/image/fetch/$s_!ay5v!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64784492-8dfb-479c-8521-7b2ce0f62c5d_840x1320.png 848w, https://substackcdn.com/image/fetch/$s_!ay5v!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64784492-8dfb-479c-8521-7b2ce0f62c5d_840x1320.png 1272w, https://substackcdn.com/image/fetch/$s_!ay5v!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64784492-8dfb-479c-8521-7b2ce0f62c5d_840x1320.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Venice partial update (Source: LinkedIn)</figcaption></figure></div><p>Inside the Venice server, the leader replica:</p><ul><li><p>decodes the incoming payload</p></li><li><p>applies the update</p></li><li><p>re-encodes the result</p></li><li><p>writes to the local database</p></li><li><p>writes to the Version Topic</p></li><li><p>follower replicas consume the merged results</p></li></ul><p>Most of that is CPU-heavy.</p><p>Then the platform evolved further with active/active replication across multiple data centers. The key mechanism is deterministic conflict resolution (DCR), similar to CRDTs. Venice tracks update timestamps at row and field levels, compares incoming timestamps with existing ones and decides to apply or skip.</p><p><strong>Venice Active/Active workflow</strong></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!36Hk!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F285f3b43-13b3-4437-b6fd-72984728e60b_1024x1516.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!36Hk!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F285f3b43-13b3-4437-b6fd-72984728e60b_1024x1516.png 424w, https://substackcdn.com/image/fetch/$s_!36Hk!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F285f3b43-13b3-4437-b6fd-72984728e60b_1024x1516.png 848w, https://substackcdn.com/image/fetch/$s_!36Hk!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F285f3b43-13b3-4437-b6fd-72984728e60b_1024x1516.png 1272w, https://substackcdn.com/image/fetch/$s_!36Hk!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F285f3b43-13b3-4437-b6fd-72984728e60b_1024x1516.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!36Hk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F285f3b43-13b3-4437-b6fd-72984728e60b_1024x1516.png" width="1024" height="1516" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/285f3b43-13b3-4437-b6fd-72984728e60b_1024x1516.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1516,&quot;width&quot;:1024,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:510735,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.datatinkerer.io/i/187999868?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F285f3b43-13b3-4437-b6fd-72984728e60b_1024x1516.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!36Hk!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F285f3b43-13b3-4437-b6fd-72984728e60b_1024x1516.png 424w, https://substackcdn.com/image/fetch/$s_!36Hk!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F285f3b43-13b3-4437-b6fd-72984728e60b_1024x1516.png 848w, https://substackcdn.com/image/fetch/$s_!36Hk!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F285f3b43-13b3-4437-b6fd-72984728e60b_1024x1516.png 1272w, https://substackcdn.com/image/fetch/$s_!36Hk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F285f3b43-13b3-4437-b6fd-72984728e60b_1024x1516.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Venice Active/Active workflow (Source: LinkedIn)</figcaption></figure></div><p>Now the leader replica has even more to do for DCR:</p><ul><li><p>timestamp metadata lookup</p></li><li><p>decoding</p></li><li><p>encoding</p></li></ul><p>Again: CPU heavy. So the optimisation below focus on CPU efficiency.</p><p><strong>1. Fast-Avro adoption</strong></p><p><a href="https://github.com/linkedin/avro-util">Fast-Avro</a> was originally developed by RTBHouse but LinkedIn took over maintenance under the LinkedIn namespace and introduced many optimizations.</p><p>The key idea: Fast-Avro is an alternative to Apache Avro serialization and deserialization using runtime code generation which performs significantly better than the native implementation. It supports multiple Avro versions at runtime and is widely adopted inside LinkedIn.</p><p>Venice fully integrated Fast-Avro and saw, in one major use case, up to a <strong>90% improvement in deserialization latency at p99</strong> on the application side.</p><p><strong>2. Parallel processing</strong></p><p>In the traditional pipeline, DCR and partial update operations were executed sequentially, record by record within the same partition. That leads to CPU underutilization.</p><p>Venice introduced parallel processing so multiple records can be handled concurrently within the same partition <em>before</em> producing them to the version topic, while still preserving strict ordering in the final step.</p><p>Result: significantly improved write throughput for these complex record types.</p><div><hr></div><h4>Use Case 4: Active/active replication with deterministic write latency</h4><p>Eventually consistent systems still get judged by human expectations. People want their writes to show up and they want it to happen predictably.</p><p>Venice is versioned and can ingest backup, current and future versions concurrently in a single server instance. In practice though, only the current version serves reads so deterministic write latency guarantees focus mostly there.</p><p>To improve determinism, Venice introduced a pooling strategy in ingestion with <strong>different priorities</strong> for different workload types. The Venice consumer phase is the first phase in the server ingestion pipeline and controlling the polling rate via pools is how prioritization happens.</p><p>Broad priority tiers:</p><ul><li><p>top priority: active/active and partial update workloads for the <strong>current version on the leader replica</strong> (CPU-intensive and latency-sensitive)</p></li><li><p>next: other workload types targeting the current version</p></li><li><p>then: active/active or partial update workloads for backup or future versions on the leader replica</p></li><li><p>finally: everything else in a lower-priority bucket</p></li></ul><p>This design is trying to do a few practical things:</p><ul><li><p>isolate CPU-heavy workloads so they don&#8217;t slow down lighter ones</p></li><li><p>prioritize the current version so the most up-to-date data flows smoothly</p></li><li><p>keep the number of pools limited to avoid resource management turning into a second job</p></li></ul><p>The catch is tuning. Clusters see different workloads, store behavior varies widely even within one cluster, throughput swings over time and read traffic changes throughout the day. Static configs force you to tune for worst-case, which wastes resources most of the time.</p><p>So Venice introduced adaptive throttling: dynamically adjust ingestion based on recent performance.</p><ul><li><p>if the system is within agreed SLAs, ingestion rates are adjusted according to priorities</p></li><li><p>if an SLA is violated, ingestion is throttled back immediately</p></li></ul><p>Defining the SLAs matters. Venice focuses on two key criteria:</p><ol><li><p><strong>Read latency SLA</strong>: highest priority. Never violate read latency SLAs, even if it costs ingestion throughput</p></li><li><p><strong>Write latency SLA for the current version</strong>: while read latency SLAs are met, write latency for the current version becomes top priority, pools are tuned proportionally to maximize utilization and throughput</p></li></ol><div><hr></div><h4><strong>Wrapping up</strong></h4><p>With these optimizations, Venice at LinkedIn handles:</p><ul><li><p>Over <strong>175 million key lookups per second</strong></p></li><li><p>Over <strong>230 million writes per second</strong></p></li><li><p>While maintaining a <strong>write latency SLA under 10 minutes</strong></p></li></ul><div><hr></div><h3>The full scoop</h3><p>To learn more about this, check <a href="https://www.linkedin.com/blog/engineering/infrastructure/evolution-of-the-venice-ingestion-pipeline">LinkedIn's Engineering Blog</a> post on this topic</p><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.datatinkerer.io/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">If you liked this post and don&#8217;t want to miss the next one, subscribe to Data Tinkerer!</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p>If you are already subscribed and enjoyed the article, please give it a like and/or share it others, really appreciate it &#128591;</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.datatinkerer.io/p/how-linkedin-built-a-pipeline-that-scales-to-230-million-records?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.datatinkerer.io/p/how-linkedin-built-a-pipeline-that-scales-to-230-million-records?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p><div><hr></div><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;7dd74b6f-84de-4b87-a0cf-3e440ec7dc65&quot;,&quot;caption&quot;:&quot;Grab needed to detect schema and value issues in Kafka streams while data was still in motion.<br /><br />This piece breaks down how they introduced real-time checks and fast alerts to catch poison events before they spread.&quot;,&quot;cta&quot;:&quot;Read full story&quot;,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;lg&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;How Grab Detects Data Issues across 100+ Kafka Topics Before They Spread&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:291590442,&quot;name&quot;:&quot;Data Tinkerer&quot;,&quot;bio&quot;:&quot;Head of analytics sharing deep dives and learnings about AI and all things data (science, engineering, analysis)&quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/83d3bbe0-5fb8-4f8d-9b74-036abbd6fec9_500x500.png&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2026-01-15T04:15:57.055Z&quot;,&quot;cover_image&quot;:&quot;https://images.unsplash.com/photo-1624957083543-9a67140fabfd?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHw0fHxncmFifGVufDB8fHx8MTc2NzkzMTIwOXww&amp;ixlib=rb-4.1.0&amp;q=80&amp;w=1080&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://www.datatinkerer.io/p/how-grab-detects-data-issues-across-100-kafka-topics&quot;,&quot;section_name&quot;:&quot;Data Engineering&quot;,&quot;video_upload_id&quot;:null,&quot;id&quot;:183755897,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:15,&quot;comment_count&quot;:0,&quot;publication_id&quot;:3422740,&quot;publication_name&quot;:&quot;Data Tinkerer&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!JEdj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6bea5ccd-f356-4154-a1db-16268300510e_500x500.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><h3>Keep learning</h3><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;2b5e61e3-2de5-4088-981d-80de61411bd4&quot;,&quot;caption&quot;:&quot;Uber rebuilt its data lake ingestion to move freshness from hours to minutes.<br /><br />This piece breaks down how they replaced batch Spark jobs with Flink streaming, cut compute by 25% and dealt with the very real problems that show up at petabyte scale.&quot;,&quot;cta&quot;:&quot;Read full story&quot;,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;lg&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;How Uber Cut Data Lake Freshness From Hours to Minutes With Flink&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:291590442,&quot;name&quot;:&quot;Data Tinkerer&quot;,&quot;bio&quot;:&quot;Ex-head of analytics sharing deep dives and learnings about AI and all things data (science, engineering, analysis)&quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/83d3bbe0-5fb8-4f8d-9b74-036abbd6fec9_500x500.png&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2026-01-02T04:30:31.300Z&quot;,&quot;cover_image&quot;:&quot;https://images.unsplash.com/photo-1615929362369-4af8423ebd68?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxNHx8dWJlcnxlbnwwfHx8fDE3NjcwNzc2NDZ8MA&amp;ixlib=rb-4.1.0&amp;q=80&amp;w=1080&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://www.datatinkerer.io/p/how-uber-cut-data-lake-freshness-from-hours-to-minutes-with-flink&quot;,&quot;section_name&quot;:&quot;Data Engineering&quot;,&quot;video_upload_id&quot;:null,&quot;id&quot;:182833470,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:17,&quot;comment_count&quot;:1,&quot;publication_id&quot;:3422740,&quot;publication_name&quot;:&quot;Data Tinkerer&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!JEdj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6bea5ccd-f356-4154-a1db-16268300510e_500x500.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div>]]></content:encoded></item><item><title><![CDATA[How Grab Detects Data Issues across 100+ Kafka Topics Before They Spread]]></title><description><![CDATA[Real-time stream validation surfaces poison records early and notifies owners with context]]></description><link>https://www.datatinkerer.io/p/how-grab-detects-data-issues-across-100-kafka-topics</link><guid isPermaLink="false">https://www.datatinkerer.io/p/how-grab-detects-data-issues-across-100-kafka-topics</guid><dc:creator><![CDATA[Data Tinkerer]]></dc:creator><pubDate>Thu, 15 Jan 2026 04:15:57 GMT</pubDate><enclosure url="https://images.unsplash.com/photo-1624957083543-9a67140fabfd?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHw0fHxncmFifGVufDB8fHx8MTc2NzkzMTIwOXww&amp;ixlib=rb-4.1.0&amp;q=80&amp;w=1080" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Fellow Data Tinkerers</p><p>Today we will look at how Grab detects data issues in real-time. </p><p>But before that, I wanted to share with you what you could unlock if you share Data Tinkerer with just <strong>1 more person</strong>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!6Doc!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f76efe7-9d16-4ec3-8e25-0e3d2282705b_800x402.gif" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!6Doc!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f76efe7-9d16-4ec3-8e25-0e3d2282705b_800x402.gif 424w, https://substackcdn.com/image/fetch/$s_!6Doc!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f76efe7-9d16-4ec3-8e25-0e3d2282705b_800x402.gif 848w, https://substackcdn.com/image/fetch/$s_!6Doc!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f76efe7-9d16-4ec3-8e25-0e3d2282705b_800x402.gif 1272w, https://substackcdn.com/image/fetch/$s_!6Doc!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f76efe7-9d16-4ec3-8e25-0e3d2282705b_800x402.gif 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!6Doc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f76efe7-9d16-4ec3-8e25-0e3d2282705b_800x402.gif" width="800" height="402" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5f76efe7-9d16-4ec3-8e25-0e3d2282705b_800x402.gif&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:402,&quot;width&quot;:800,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:3369150,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/gif&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.datatinkerer.io/i/183755897?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f76efe7-9d16-4ec3-8e25-0e3d2282705b_800x402.gif&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!6Doc!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f76efe7-9d16-4ec3-8e25-0e3d2282705b_800x402.gif 424w, https://substackcdn.com/image/fetch/$s_!6Doc!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f76efe7-9d16-4ec3-8e25-0e3d2282705b_800x402.gif 848w, https://substackcdn.com/image/fetch/$s_!6Doc!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f76efe7-9d16-4ec3-8e25-0e3d2282705b_800x402.gif 1272w, https://substackcdn.com/image/fetch/$s_!6Doc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f76efe7-9d16-4ec3-8e25-0e3d2282705b_800x402.gif 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>There are 100+ resources to learn all things data (science, engineering, analysis). It includes videos, courses, projects and can be filtered by tech stack (Python, SQL, Spark and etc), skill level (Beginner, Intermediate and so on) provider name or free/paid. So if you know other people who like staying up to date on all things data, please share Data Tinkerer with them!</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.datatinkerer.io/leaderboard?&amp;utm_source=post&quot;,&quot;text&quot;:&quot;Refer a friend&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.datatinkerer.io/leaderboard?&amp;utm_source=post"><span>Refer a friend</span></a></p><p>Now, with that out of the way, let&#8217;s get to Grab&#8217;s real-time work!</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://images.unsplash.com/photo-1558899367-3cd83fb31ed8?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxfHxncmFifGVufDB8fHx8MTc2NzkzMTIwOXww&amp;ixlib=rb-4.1.0&amp;q=80&amp;w=1080" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://images.unsplash.com/photo-1558899367-3cd83fb31ed8?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxfHxncmFifGVufDB8fHx8MTc2NzkzMTIwOXww&amp;ixlib=rb-4.1.0&amp;q=80&amp;w=1080 424w, https://images.unsplash.com/photo-1558899367-3cd83fb31ed8?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxfHxncmFifGVufDB8fHx8MTc2NzkzMTIwOXww&amp;ixlib=rb-4.1.0&amp;q=80&amp;w=1080 848w, https://images.unsplash.com/photo-1558899367-3cd83fb31ed8?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxfHxncmFifGVufDB8fHx8MTc2NzkzMTIwOXww&amp;ixlib=rb-4.1.0&amp;q=80&amp;w=1080 1272w, https://images.unsplash.com/photo-1558899367-3cd83fb31ed8?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxfHxncmFifGVufDB8fHx8MTc2NzkzMTIwOXww&amp;ixlib=rb-4.1.0&amp;q=80&amp;w=1080 1456w" sizes="100vw"><img src="https://images.unsplash.com/photo-1558899367-3cd83fb31ed8?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxfHxncmFifGVufDB8fHx8MTc2NzkzMTIwOXww&amp;ixlib=rb-4.1.0&amp;q=80&amp;w=1080" width="6000" height="4000" data-attrs="{&quot;src&quot;:&quot;https://images.unsplash.com/photo-1558899367-3cd83fb31ed8?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxfHxncmFifGVufDB8fHx8MTc2NzkzMTIwOXww&amp;ixlib=rb-4.1.0&amp;q=80&amp;w=1080&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:4000,&quot;width&quot;:6000,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;man riding bicycle&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="man riding bicycle" title="man riding bicycle" srcset="https://images.unsplash.com/photo-1558899367-3cd83fb31ed8?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxfHxncmFifGVufDB8fHx8MTc2NzkzMTIwOXww&amp;ixlib=rb-4.1.0&amp;q=80&amp;w=1080 424w, https://images.unsplash.com/photo-1558899367-3cd83fb31ed8?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxfHxncmFifGVufDB8fHx8MTc2NzkzMTIwOXww&amp;ixlib=rb-4.1.0&amp;q=80&amp;w=1080 848w, https://images.unsplash.com/photo-1558899367-3cd83fb31ed8?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxfHxncmFifGVufDB8fHx8MTc2NzkzMTIwOXww&amp;ixlib=rb-4.1.0&amp;q=80&amp;w=1080 1272w, https://images.unsplash.com/photo-1558899367-3cd83fb31ed8?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxfHxncmFifGVufDB8fHx8MTc2NzkzMTIwOXww&amp;ixlib=rb-4.1.0&amp;q=80&amp;w=1080 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Photo by <a href="https://unsplash.com/@javaistan">Afif Ramdhasuma</a> on <a href="https://unsplash.com">Unsplash</a></figcaption></figure></div><h3>TL;DR</h3><div><hr></div><h4><strong>Situation</strong></h4><p>Grab runs critical systems on Kafka streams, where bad data can spread and break downstream consumers. Existing checks were slow and mostly limited to schemas, making issues hard to catch and debug.</p><h4><strong>Task</strong></h4><p>Detect bad streaming data early, cover both schema and value-level issues and give stream owners fast, actionable visibility without centralising ownership.</p><h4><strong>Action</strong></h4><p>Grab built contract-driven stream checks on Coban, turning schemas, field rules and ownership into real-time FlinkSQL tests with Slack alerts and UI-based inspection of bad records.</p><h4><strong>Result</strong></h4><p>The system now monitors 100+ Kafka topics in real time, surfaces poison data quickly and helps teams stop issues before they cascade downstream.</p><h4><strong>Use Cases</strong></h4><p>Root cause analysis, real-time monitoring, real-time alerting</p><h4><strong>Tech Stack/Framework</strong></h4><p>Apache Kafka, Apache Flink, Amazon S3, Slack, LLM</p><div><hr></div><h3>Explained further</h3><div><hr></div><h4><strong>About Grab</strong></h4><p><a href="https://www.grab.com/">Grab</a> is often called the Uber of Southeast Asia but that might be selling it short. What started as a ride-hailing app now powers food delivery, groceries, payments and even insurance all bundled into one super app. They run across over 800 cities in 8 Southeast Asian countries. Behind the rides, meals, and payments lies an enormous stream of events flowing through Grab&#8217;s systems.</p><div><hr></div><h4>Background</h4><p>Grab runs a lot of business on streaming data. Kafka topics feed online systems, offline analytics and machine learning pipelines. When those streams are clean, life is good: teams can move faster, models behave, dashboards run smoothly. But when they&#8217;re not clean, it&#8217;s a major headache.</p><p>The tricky part is that &#8216;bad data&#8217; in Kafka isn&#8217;t always obvious. Sometimes it&#8217;s quiet: the stream still parses but key fields are wrong, missing or shaped differently than what downstream teams assume.</p><p>That&#8217;s why Grab decided to introduce a platform-level solution: Kafka stream contracts that let stream stakeholders define what &#8216;good&#8217; looks like, then automatically test streams in real time, catch issues as they happen and alert the owners quickly.</p><p>The core idea is simple:</p><ul><li><p>Let users define a data contract for a Kafka topic</p></li><li><p>Convert that contract into executable tests</p></li><li><p>Run those tests continuously</p></li><li><p>Capture the poison data plus context</p></li><li><p>Notify the right people with enough detail to act</p></li></ul><p>This supports a more decentralized, data-mesh style world where teams own their data products while still keeping the overall system reliable for everyone else.</p><div><hr></div><h4>What wasn&#8217;t working before</h4><p>Historically, monitoring Kafka stream data processing didn&#8217;t have a strong, end-to-end solution for data quality validation. That created three big issues: detecting bad data, speed of detection and lack of visibility.</p><p><strong>1- Detecting bad data</strong></p><p>This can be broken down into two further categories:</p><p><strong>1.1 Schema issues</strong></p><p>These are schema mismatches between producers and consumers that can trigger deserialization errors. Even if schema backward compatibility is validated during schema evolution, the data inside the Kafka topic can still drift from the defined schema.</p><p>One concrete example: a rogue producer writes to a topic without using the expected schema. Now you&#8217;ve got a topic that &#8216;has a schema&#8217; but real events don&#8217;t match it. The painful bit is not just knowing something broke, it&#8217;s identifying which fields are causing the mismatch.</p><p><strong>1.2 Rule and value issues</strong><br>These are disagreements about what a field <em>means</em> or what shape it should take. Kafka stream schemas define structure but they don&#8217;t enforce rules like:</p><ul><li><p>expected length for an identifier</p></li><li><p>expected string pattern</p></li><li><p>valid numeric ranges</p></li><li><p>constant values that should never change</p></li></ul><p>There wasn&#8217;t an existing framework where stakeholders could define and enforce field-level semantic rules for streams.</p><p><strong>2- Speed of detection</strong></p><p>The second issue was speed of detection. There was no real-time mechanism to automatically validate data against predefined rules, identify issues quickly and alert stakeholders promptly.</p><p>Without real-time validation, issues could stick around for a while, quietly impacting multiple online and offline downstream systems before being discovered.</p><p><strong>3- Lack of visibility</strong></p><p>Even when teams did detect a problem, it was hard to pinpoint the exact &#8216;poison data&#8217; and understand what violated the schema or the semantic expectations.</p><p>Root cause analysis becomes painful when you cannot easily answer:</p><ul><li><p>Which records were bad?</p></li><li><p>Which fields failed?</p></li><li><p>What did the bad values look like?</p></li><li><p>When did it start and how frequent is it?</p></li></ul><div><hr></div><h4>The fix</h4><p>Grab&#8217;s Coban platform provides a standardized, platform-level data quality testing and observability setup for Kafka streams. It&#8217;s built around four core ideas:</p><ol><li><p><strong>Data Contract Definition: </strong>Stream stakeholders define a contract that includes schema agreements, semantic rules the topic data must follow, and ownership metadata for alerts and notifications.</p></li><li><p><strong>Automated Test Execution: </strong>A long-running test runner automatically executes real-time tests based on that contract.</p></li><li><p><strong>Real-time Data Quality Issue Identification: </strong>The system detects data issues in real time at both schema and rules/values levels.</p></li><li><p><strong>Alerts and Result Observability: </strong>It alerts the right people and makes it easier to observe issues through the platform UI and downstream tooling.</p></li></ol><p>Put simply: define the rules once, then let the platform watch the stream continuously.</p><p>The architecture has three main components:</p><ol><li><p><strong>Data contract definition</strong></p></li><li><p><strong>Test execution and data quality issue identification</strong></p></li><li><p><strong>Result observability</strong></p></li></ol><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!opMG!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f5a043c-954d-4cfc-b21e-a3dfd346fe47_1991x743.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!opMG!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f5a043c-954d-4cfc-b21e-a3dfd346fe47_1991x743.jpeg 424w, https://substackcdn.com/image/fetch/$s_!opMG!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f5a043c-954d-4cfc-b21e-a3dfd346fe47_1991x743.jpeg 848w, https://substackcdn.com/image/fetch/$s_!opMG!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f5a043c-954d-4cfc-b21e-a3dfd346fe47_1991x743.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!opMG!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f5a043c-954d-4cfc-b21e-a3dfd346fe47_1991x743.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!opMG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f5a043c-954d-4cfc-b21e-a3dfd346fe47_1991x743.jpeg" width="1456" height="543" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5f5a043c-954d-4cfc-b21e-a3dfd346fe47_1991x743.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:543,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:224037,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.datatinkerer.io/i/183755897?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f5a043c-954d-4cfc-b21e-a3dfd346fe47_1991x743.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!opMG!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f5a043c-954d-4cfc-b21e-a3dfd346fe47_1991x743.jpeg 424w, https://substackcdn.com/image/fetch/$s_!opMG!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f5a043c-954d-4cfc-b21e-a3dfd346fe47_1991x743.jpeg 848w, https://substackcdn.com/image/fetch/$s_!opMG!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f5a043c-954d-4cfc-b21e-a3dfd346fe47_1991x743.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!opMG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f5a043c-954d-4cfc-b21e-a3dfd346fe47_1991x743.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Real-time Kafka Stream Data Quality Monitoring Architecture (Source: Grab)</figcaption></figure></div><p>All Flow mentions after this refer to those diagrammed steps above</p><div><hr></div><h4><strong>Data contract definition</strong></h4><p>Coban&#8217;s contract acts as a formal agreement among Kafka stream stakeholders. It includes a few building blocks.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!KSXy!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F73bb2962-4b96-49b4-ad1d-7709b39753d7_836x758.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!KSXy!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F73bb2962-4b96-49b4-ad1d-7709b39753d7_836x758.jpeg 424w, https://substackcdn.com/image/fetch/$s_!KSXy!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F73bb2962-4b96-49b4-ad1d-7709b39753d7_836x758.jpeg 848w, https://substackcdn.com/image/fetch/$s_!KSXy!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F73bb2962-4b96-49b4-ad1d-7709b39753d7_836x758.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!KSXy!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F73bb2962-4b96-49b4-ad1d-7709b39753d7_836x758.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!KSXy!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F73bb2962-4b96-49b4-ad1d-7709b39753d7_836x758.jpeg" width="836" height="758" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/73bb2962-4b96-49b4-ad1d-7709b39753d7_836x758.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:758,&quot;width&quot;:836,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:120852,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.datatinkerer.io/i/183755897?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F73bb2962-4b96-49b4-ad1d-7709b39753d7_836x758.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!KSXy!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F73bb2962-4b96-49b4-ad1d-7709b39753d7_836x758.jpeg 424w, https://substackcdn.com/image/fetch/$s_!KSXy!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F73bb2962-4b96-49b4-ad1d-7709b39753d7_836x758.jpeg 848w, https://substackcdn.com/image/fetch/$s_!KSXy!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F73bb2962-4b96-49b4-ad1d-7709b39753d7_836x758.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!KSXy!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F73bb2962-4b96-49b4-ad1d-7709b39753d7_836x758.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Source: Grab</figcaption></figure></div><p><strong>Kafka Stream Schema (Flow 1.1)</strong></p><p>The contract includes the schema used by the Kafka topic under test. This helps the Test Runner validate schema compatibility across data streams.</p><p>Importantly, this is not only about &#8220;did the schema change.&#8221; It&#8217;s also about &#8220;does the data actually match what everyone believes the schema is.&#8221;</p><p><strong>Kafka Stream Configuration (Flow 1.2)</strong></p><p>This includes essential config like endpoint and topic name. Coban automatically populates this so users don&#8217;t have to wire everything manually.</p><p><strong>Observability Metadata (Flow 1.3)</strong></p><p>This is where ownership becomes real. The contract includes contact details for stream stakeholders and alert configurations so the right people get notified when issues show up.</p><p><strong>Kafka Stream Semantic Test Rules (Flow 1.5)</strong></p><p>This is the heart of the semantic side. Users can define intuitive field-level rules such as:</p><ul><li><p>string pattern checks</p></li><li><p>number range checks</p></li><li><p>constant value checks</p></li></ul><p>The point is to make the &#8220;meaning&#8221; of fields enforceable, not just their data types.</p><p><strong>LLM-Based Semantic Test Rules Recommendation (Flow 1.4)</strong></p><p>Defining dozens or hundreds of field rules can overwhelm people. To reduce that setup burden, Coban uses an LLM-based feature that recommends semantic test rules based on:</p><ul><li><p>the provided Kafka stream schema</p></li><li><p>anonymized sample data</p></li></ul><p>This feature helps users set up semantic rules efficiently, as demonstrated below</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!pu8X!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff96422b2-5c1c-4f0a-8fb8-af92a409d344_1999x717.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!pu8X!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff96422b2-5c1c-4f0a-8fb8-af92a409d344_1999x717.png 424w, https://substackcdn.com/image/fetch/$s_!pu8X!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff96422b2-5c1c-4f0a-8fb8-af92a409d344_1999x717.png 848w, https://substackcdn.com/image/fetch/$s_!pu8X!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff96422b2-5c1c-4f0a-8fb8-af92a409d344_1999x717.png 1272w, https://substackcdn.com/image/fetch/$s_!pu8X!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff96422b2-5c1c-4f0a-8fb8-af92a409d344_1999x717.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!pu8X!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff96422b2-5c1c-4f0a-8fb8-af92a409d344_1999x717.png" width="1456" height="522" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f96422b2-5c1c-4f0a-8fb8-af92a409d344_1999x717.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:522,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:241835,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.datatinkerer.io/i/183755897?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff96422b2-5c1c-4f0a-8fb8-af92a409d344_1999x717.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!pu8X!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff96422b2-5c1c-4f0a-8fb8-af92a409d344_1999x717.png 424w, https://substackcdn.com/image/fetch/$s_!pu8X!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff96422b2-5c1c-4f0a-8fb8-af92a409d344_1999x717.png 848w, https://substackcdn.com/image/fetch/$s_!pu8X!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff96422b2-5c1c-4f0a-8fb8-af92a409d344_1999x717.png 1272w, https://substackcdn.com/image/fetch/$s_!pu8X!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff96422b2-5c1c-4f0a-8fb8-af92a409d344_1999x717.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Sample UI showcasing LLM-based Kafka stream schema field-level semantic test rules (Source: Grab)</figcaption></figure></div><p>The practical benefit: users get a starting point quickly, instead of staring at a schema and trying to invent rules from scratch.</p><div><hr></div><h4><strong>Data contract transformation</strong></h4><p>Once a contract is defined, Coban&#8217;s transformation engine converts it into configurations the Test Runner can interpret (Flow 2.1).</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!nvEa!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa26d3201-18ad-4d6f-af2f-727f179fc238_1122x660.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!nvEa!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa26d3201-18ad-4d6f-af2f-727f179fc238_1122x660.jpeg 424w, https://substackcdn.com/image/fetch/$s_!nvEa!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa26d3201-18ad-4d6f-af2f-727f179fc238_1122x660.jpeg 848w, https://substackcdn.com/image/fetch/$s_!nvEa!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa26d3201-18ad-4d6f-af2f-727f179fc238_1122x660.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!nvEa!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa26d3201-18ad-4d6f-af2f-727f179fc238_1122x660.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!nvEa!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa26d3201-18ad-4d6f-af2f-727f179fc238_1122x660.jpeg" width="1122" height="660" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a26d3201-18ad-4d6f-af2f-727f179fc238_1122x660.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:660,&quot;width&quot;:1122,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:135025,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.datatinkerer.io/i/183755897?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d4065d7-b4c5-4f78-8761-0addce18f606_1122x660.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!nvEa!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa26d3201-18ad-4d6f-af2f-727f179fc238_1122x660.jpeg 424w, https://substackcdn.com/image/fetch/$s_!nvEa!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa26d3201-18ad-4d6f-af2f-727f179fc238_1122x660.jpeg 848w, https://substackcdn.com/image/fetch/$s_!nvEa!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa26d3201-18ad-4d6f-af2f-727f179fc238_1122x660.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!nvEa!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa26d3201-18ad-4d6f-af2f-727f179fc238_1122x660.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Source: Grab</figcaption></figure></div><p>This transformation covers four things:</p><p><strong>Kafka Stream Schema: </strong>The contract schema is translated into a schema reference format the Test Runner can parse.</p><p><strong>Kafka Stream Configuration: </strong>The Kafka stream is set up as a source for the Test Runner.</p><p><strong>Observability metadata: </strong>Contact information is turned into runtime configs for alerting and routing.</p><p><strong>Kafka Stream Semantic Test Rules: </strong>Human-readable semantic rules are transformed into an <strong>inverse SQL query</strong> that captures data violating the rules.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!SeoF!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00cfa6bd-52be-4143-a79d-caeb71276749_1972x1104.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!SeoF!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00cfa6bd-52be-4143-a79d-caeb71276749_1972x1104.jpeg 424w, https://substackcdn.com/image/fetch/$s_!SeoF!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00cfa6bd-52be-4143-a79d-caeb71276749_1972x1104.jpeg 848w, https://substackcdn.com/image/fetch/$s_!SeoF!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00cfa6bd-52be-4143-a79d-caeb71276749_1972x1104.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!SeoF!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00cfa6bd-52be-4143-a79d-caeb71276749_1972x1104.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!SeoF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00cfa6bd-52be-4143-a79d-caeb71276749_1972x1104.jpeg" width="1456" height="815" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/00cfa6bd-52be-4143-a79d-caeb71276749_1972x1104.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:815,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:213548,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.datatinkerer.io/i/183755897?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00cfa6bd-52be-4143-a79d-caeb71276749_1972x1104.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!SeoF!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00cfa6bd-52be-4143-a79d-caeb71276749_1972x1104.jpeg 424w, https://substackcdn.com/image/fetch/$s_!SeoF!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00cfa6bd-52be-4143-a79d-caeb71276749_1972x1104.jpeg 848w, https://substackcdn.com/image/fetch/$s_!SeoF!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00cfa6bd-52be-4143-a79d-caeb71276749_1972x1104.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!SeoF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00cfa6bd-52be-4143-a79d-caeb71276749_1972x1104.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Illustration of semantic test rules being converted from human-readable formats into inverse SQL queries (Source: Grab)</figcaption></figure></div><p>&#8216;Inverse SQL&#8217; here means the query is designed to return the <em>bad rows</em>, not the good ones. That&#8217;s a smart design choice because it keeps the output focused on what needs investigation.</p><div><hr></div><h4>Test execution &amp; data quality issue identification</h4><p>Once the transformation engine generates the configuration, the platform automatically deploys the Test Runner.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Y-bs!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbd8212e-0700-4411-be3d-384ee376c7ec_1010x734.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Y-bs!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbd8212e-0700-4411-be3d-384ee376c7ec_1010x734.jpeg 424w, https://substackcdn.com/image/fetch/$s_!Y-bs!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbd8212e-0700-4411-be3d-384ee376c7ec_1010x734.jpeg 848w, https://substackcdn.com/image/fetch/$s_!Y-bs!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbd8212e-0700-4411-be3d-384ee376c7ec_1010x734.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!Y-bs!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbd8212e-0700-4411-be3d-384ee376c7ec_1010x734.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Y-bs!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbd8212e-0700-4411-be3d-384ee376c7ec_1010x734.jpeg" width="1010" height="734" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bbd8212e-0700-4411-be3d-384ee376c7ec_1010x734.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:734,&quot;width&quot;:1010,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:96110,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.datatinkerer.io/i/183755897?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf8dc273-8996-4ec1-a825-41a85d232746_1010x734.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Y-bs!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbd8212e-0700-4411-be3d-384ee376c7ec_1010x734.jpeg 424w, https://substackcdn.com/image/fetch/$s_!Y-bs!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbd8212e-0700-4411-be3d-384ee376c7ec_1010x734.jpeg 848w, https://substackcdn.com/image/fetch/$s_!Y-bs!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbd8212e-0700-4411-be3d-384ee376c7ec_1010x734.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!Y-bs!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbd8212e-0700-4411-be3d-384ee376c7ec_1010x734.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Source: Grab</figcaption></figure></div><p><strong>Test runner</strong></p><p>The Test Runner uses FlinkSQL as its compute engine. FlinkSQL was chosen because it makes defining rules straightforward using SQL statements, which also makes it easier for the platform to convert contracts into enforceable checks.</p><p><strong>Test execution workflow and problematic data identification</strong></p><p>Below are the 4 steps undertaken to execute the test and identify problematic data:</p><ol><li><p><strong>Consume Kafka data (Flow 2.2)</strong><br>FlinkSQL consumes data from the Kafka topic under test using its own consumer group. This is important because it avoids impacting other consumers.</p></li><li><p><strong>Run inverse SQL (Flow 2.3)</strong><br>The Test Runner runs the inverse SQL query to identify:</p><ul><li><p>data that violates semantic rules</p></li><li><p>data that is syntactically incorrect &#8220;in the first place&#8221;</p></li></ul></li><li><p><strong>Publish data quality issue events (Flow 3.2)</strong><br>When bad data is found, the Test Runner packages it into a data quality issue event enriched with:</p><ul><li><p>a test summary</p></li><li><p>total count of bad records</p></li><li><p>sample bad data</p></li></ul><p>Then it publishes the event to a dedicated Kafka topic.</p></li><li><p><strong>Sink events to S3 (Flow 3.1)</strong><br>The platform also sinks all data quality events to an AWS S3 bucket for deeper observability and analysis.</p></li></ol><p>This combo (Kafka for realtime events, S3 for deeper inspection) gives both fast alerting and a more durable store for later analysis.</p><div><hr></div><h4>Result observability</h4><p>Grab&#8217;s in-house data quality observability platform, Genchi, consumes the problematic data captured by the Test Runner (Flow 3.3).</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!n2A8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8b376a29-23b4-420d-b72c-def7101963c5_838x618.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!n2A8!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8b376a29-23b4-420d-b72c-def7101963c5_838x618.jpeg 424w, https://substackcdn.com/image/fetch/$s_!n2A8!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8b376a29-23b4-420d-b72c-def7101963c5_838x618.jpeg 848w, https://substackcdn.com/image/fetch/$s_!n2A8!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8b376a29-23b4-420d-b72c-def7101963c5_838x618.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!n2A8!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8b376a29-23b4-420d-b72c-def7101963c5_838x618.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!n2A8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8b376a29-23b4-420d-b72c-def7101963c5_838x618.jpeg" width="838" height="618" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8b376a29-23b4-420d-b72c-def7101963c5_838x618.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:618,&quot;width&quot;:838,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:64056,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.datatinkerer.io/i/183755897?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8b376a29-23b4-420d-b72c-def7101963c5_838x618.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!n2A8!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8b376a29-23b4-420d-b72c-def7101963c5_838x618.jpeg 424w, https://substackcdn.com/image/fetch/$s_!n2A8!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8b376a29-23b4-420d-b72c-def7101963c5_838x618.jpeg 848w, https://substackcdn.com/image/fetch/$s_!n2A8!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8b376a29-23b4-420d-b72c-def7101963c5_838x618.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!n2A8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8b376a29-23b4-420d-b72c-def7101963c5_838x618.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Source: Grab</figcaption></figure></div><p><strong>Alerting</strong></p><p>Genchi sends Slack notifications to stream owners listed in the contract&#8217;s observability metadata (Flow 3.5).</p><p>Those notifications include useful debugging context such as:</p><ul><li><p>links to sample data in the Coban UI</p></li><li><p>observed time windows</p></li><li><p>counts of bad records</p></li><li><p>other relevant details</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!avzo!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff156f000-5325-4c18-b58c-1987c5cac707_1314x478.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!avzo!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff156f000-5325-4c18-b58c-1987c5cac707_1314x478.png 424w, https://substackcdn.com/image/fetch/$s_!avzo!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff156f000-5325-4c18-b58c-1987c5cac707_1314x478.png 848w, https://substackcdn.com/image/fetch/$s_!avzo!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff156f000-5325-4c18-b58c-1987c5cac707_1314x478.png 1272w, https://substackcdn.com/image/fetch/$s_!avzo!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff156f000-5325-4c18-b58c-1987c5cac707_1314x478.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!avzo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff156f000-5325-4c18-b58c-1987c5cac707_1314x478.png" width="1314" height="478" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f156f000-5325-4c18-b58c-1987c5cac707_1314x478.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:478,&quot;width&quot;:1314,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:104689,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.datatinkerer.io/i/183755897?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff156f000-5325-4c18-b58c-1987c5cac707_1314x478.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!avzo!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff156f000-5325-4c18-b58c-1987c5cac707_1314x478.png 424w, https://substackcdn.com/image/fetch/$s_!avzo!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff156f000-5325-4c18-b58c-1987c5cac707_1314x478.png 848w, https://substackcdn.com/image/fetch/$s_!avzo!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff156f000-5325-4c18-b58c-1987c5cac707_1314x478.png 1272w, https://substackcdn.com/image/fetch/$s_!avzo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff156f000-5325-4c18-b58c-1987c5cac707_1314x478.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Sample Slack notifications (Source: Grab)</figcaption></figure></div><p>The key point is that alerts are not just &#8216;something broke&#8217;, they include the information you need to start investigating.</p><p><strong>Observability</strong></p><p>Users can access the Coban UI (Flow 3.4) to see:</p><ul><li><p>Kafka stream test rules</p></li><li><p>sample bad records</p></li><li><p>highlighted fields and values that violate rules</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!iqrn!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d9e3d9b-5030-4a02-b677-8bd277354196_1551x486.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!iqrn!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d9e3d9b-5030-4a02-b677-8bd277354196_1551x486.jpeg 424w, https://substackcdn.com/image/fetch/$s_!iqrn!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d9e3d9b-5030-4a02-b677-8bd277354196_1551x486.jpeg 848w, https://substackcdn.com/image/fetch/$s_!iqrn!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d9e3d9b-5030-4a02-b677-8bd277354196_1551x486.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!iqrn!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d9e3d9b-5030-4a02-b677-8bd277354196_1551x486.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!iqrn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d9e3d9b-5030-4a02-b677-8bd277354196_1551x486.jpeg" width="1456" height="456" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2d9e3d9b-5030-4a02-b677-8bd277354196_1551x486.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:456,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:108090,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.datatinkerer.io/i/183755897?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d9e3d9b-5030-4a02-b677-8bd277354196_1551x486.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!iqrn!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d9e3d9b-5030-4a02-b677-8bd277354196_1551x486.jpeg 424w, https://substackcdn.com/image/fetch/$s_!iqrn!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d9e3d9b-5030-4a02-b677-8bd277354196_1551x486.jpeg 848w, https://substackcdn.com/image/fetch/$s_!iqrn!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d9e3d9b-5030-4a02-b677-8bd277354196_1551x486.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!iqrn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d9e3d9b-5030-4a02-b677-8bd277354196_1551x486.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The highlighted fields indicate violations of the semantic test rules (Source: Grab)</figcaption></figure></div><p>That UI piece matters because it shortens the path from &#8216;alert received&#8217; to &#8216;I know what field is failing and what the bad values look like.&#8217;</p><div><hr></div><h4>Results so far</h4><p>Since deploying earlier in the year, this solution enabled Kafka stream users to:</p><ul><li><p>define contracts with both schema and semantic rules</p></li><li><p>automate real-time test execution</p></li><li><p>alert stakeholders when problematic data is detected so they can act quickly</p></li></ul><p>It has been actively monitoring data quality across <strong>100+ critical Kafka topics</strong>.</p><p>The solution also offers the capability to immediately identify and halt the propagation of invalid data across multiple streams.</p><div><hr></div><h4>Wrapping up</h4><p>Grab implemented and rolled out a real-time data quality monitoring solution for Kafka streams through the Coban platform.</p><p>The key outcomes include:</p><ul><li><p>engineers can define syntactic and semantic tests through a data contract</p></li><li><p>tests run automatically in real time via a long-running Test Runner based on FlinkSQL</p></li><li><p>issues trigger fast Slack alerts through Genchi using ownership metadata in the contract</p></li><li><p>teams get better visibility into exactly which data fields violate rules via the Coban UI</p></li></ul><p>In short: Coban turned data quality from a vague hope into something stream owners can specify, enforce and observe in real time.</p><div><hr></div><h3>The full scoop</h3><p>To learn more about this, check <a href="https://engineering.grab.com/real-time-data-quality-monitoring">Grab's Engineering Blog</a> post on this topic</p><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.datatinkerer.io/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">If you liked this post and don&#8217;t want to miss the next one, subscribe to Data Tinkerer!</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p>If you are already subscribed and enjoyed the article, please give it a like and/or share it others, really appreciate it &#128591;</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.datatinkerer.io/p/how-grab-detects-data-issues-across-100-kafka-topics?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.datatinkerer.io/p/how-grab-detects-data-issues-across-100-kafka-topics?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p><div><hr></div><h3>Keep learning</h3><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;2b5e61e3-2de5-4088-981d-80de61411bd4&quot;,&quot;caption&quot;:&quot;Uber rebuilt its data lake ingestion to move freshness from hours to minutes.<br /><br />This piece breaks down how they replaced batch Spark jobs with Flink streaming, cut compute by 25% and dealt with the very real problems that show up at petabyte scale.&quot;,&quot;cta&quot;:&quot;Read full story&quot;,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;lg&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;How Uber Cut Data Lake Freshness From Hours to Minutes With Flink&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:291590442,&quot;name&quot;:&quot;Data Tinkerer&quot;,&quot;bio&quot;:&quot;Ex-head of analytics sharing deep dives and learnings about AI and all things data (science, engineering, analysis)&quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/83d3bbe0-5fb8-4f8d-9b74-036abbd6fec9_500x500.png&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2026-01-02T04:30:31.300Z&quot;,&quot;cover_image&quot;:&quot;https://images.unsplash.com/photo-1615929362369-4af8423ebd68?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxNHx8dWJlcnxlbnwwfHx8fDE3NjcwNzc2NDZ8MA&amp;ixlib=rb-4.1.0&amp;q=80&amp;w=1080&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://www.datatinkerer.io/p/how-uber-cut-data-lake-freshness-from-hours-to-minutes-with-flink&quot;,&quot;section_name&quot;:&quot;Data Engineering&quot;,&quot;video_upload_id&quot;:null,&quot;id&quot;:182833470,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:17,&quot;comment_count&quot;:1,&quot;publication_id&quot;:3422740,&quot;publication_name&quot;:&quot;Data Tinkerer&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!JEdj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6bea5ccd-f356-4154-a1db-16268300510e_500x500.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;1904c23f-5462-4150-9c60-b6ad712234b6&quot;,&quot;caption&quot;:&quot;How do you keep ML teams fast when every experiment blasts your Spark cluster with spiky workloads, huge datasets and five different file formats?<br /><br />Snap&#8217;s answer: Prism, a unified Spark platform that hides infra pain, standardises pipelines and supports everything from ad-hoc exploration to 10k+ daily jobs in production.<br /><br />This post breaks down why raw Spark wasn&#8217;t enough, what Prism fixes and how Snap rebuilt their ML data layer without ditching Spark.&quot;,&quot;cta&quot;:&quot;Read full story&quot;,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;lg&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;How Snap Rebuilt Its ML Platform to Handle 10,000+ Daily Spark Jobs&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:291590442,&quot;name&quot;:&quot;Data Tinkerer&quot;,&quot;bio&quot;:&quot;Ex-head of analytics sharing deep dives and learnings about AI and all things data (science, engineering, analysis)&quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/83d3bbe0-5fb8-4f8d-9b74-036abbd6fec9_500x500.png&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2025-11-20T04:59:47.340Z&quot;,&quot;cover_image&quot;:&quot;https://images.unsplash.com/photo-1620396749162-d61bdb715793?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHw1fHxzbmFwY2hhdHxlbnwwfHx8fDE3NjM1Mjk0MDZ8MA&amp;ixlib=rb-4.1.0&amp;q=80&amp;w=1080&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://www.datatinkerer.io/p/how-snap-rebuilt-its-ml-platform-to-handle-10000-daily-spark-jobs&quot;,&quot;section_name&quot;:&quot;Data Engineering&quot;,&quot;video_upload_id&quot;:null,&quot;id&quot;:179211962,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:9,&quot;comment_count&quot;:0,&quot;publication_id&quot;:3422740,&quot;publication_name&quot;:&quot;Data Tinkerer&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!JEdj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6bea5ccd-f356-4154-a1db-16268300510e_500x500.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div>]]></content:encoded></item><item><title><![CDATA[How Uber Cut Data Lake Freshness From Hours to Minutes With Flink]]></title><description><![CDATA[Why Uber moved ingestion from Spark batch to Flink streaming and what it took to run thousands of jobs reliably at petabyte scale.]]></description><link>https://www.datatinkerer.io/p/how-uber-cut-data-lake-freshness-from-hours-to-minutes-with-flink</link><guid isPermaLink="false">https://www.datatinkerer.io/p/how-uber-cut-data-lake-freshness-from-hours-to-minutes-with-flink</guid><dc:creator><![CDATA[Data Tinkerer]]></dc:creator><pubDate>Fri, 02 Jan 2026 04:30:31 GMT</pubDate><enclosure url="https://images.unsplash.com/photo-1615929362369-4af8423ebd68?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxNHx8dWJlcnxlbnwwfHx8fDE3NjcwNzc2NDZ8MA&amp;ixlib=rb-4.1.0&amp;q=80&amp;w=1080" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Fellow Data Tinkerers!</p><p>Today we will look at how Uber moved from batch to streaming in their data lake.</p><p>But before that, I wanted to share with you what you could unlock if you share Data Tinkerer with just <strong>1 more person</strong>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!05-P!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a2affdf-e44f-452e-b02a-93809ad29f60_800x402.gif" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!05-P!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a2affdf-e44f-452e-b02a-93809ad29f60_800x402.gif 424w, https://substackcdn.com/image/fetch/$s_!05-P!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a2affdf-e44f-452e-b02a-93809ad29f60_800x402.gif 848w, https://substackcdn.com/image/fetch/$s_!05-P!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a2affdf-e44f-452e-b02a-93809ad29f60_800x402.gif 1272w, https://substackcdn.com/image/fetch/$s_!05-P!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a2affdf-e44f-452e-b02a-93809ad29f60_800x402.gif 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!05-P!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a2affdf-e44f-452e-b02a-93809ad29f60_800x402.gif" width="800" height="402" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6a2affdf-e44f-452e-b02a-93809ad29f60_800x402.gif&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:402,&quot;width&quot;:800,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:3369150,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/gif&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.datatinkerer.io/i/182833470?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a2affdf-e44f-452e-b02a-93809ad29f60_800x402.gif&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!05-P!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a2affdf-e44f-452e-b02a-93809ad29f60_800x402.gif 424w, https://substackcdn.com/image/fetch/$s_!05-P!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a2affdf-e44f-452e-b02a-93809ad29f60_800x402.gif 848w, https://substackcdn.com/image/fetch/$s_!05-P!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a2affdf-e44f-452e-b02a-93809ad29f60_800x402.gif 1272w, https://substackcdn.com/image/fetch/$s_!05-P!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a2affdf-e44f-452e-b02a-93809ad29f60_800x402.gif 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>There are 100+ resources to learn all things data (science, engineering, analysis). It includes videos, courses, projects and can be filtered by tech stack (Python, SQL, Spark and etc), skill level (Beginner, Intermediate and so on) provider name or free/paid. So if you know other people who like staying up to date on all things data, please share Data Tinkerer with them!</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.datatinkerer.io/leaderboard?&amp;utm_source=post&quot;,&quot;text&quot;:&quot;Refer a friend&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.datatinkerer.io/leaderboard?&amp;utm_source=post"><span>Refer a friend</span></a></p><p>Now, with that out of the way, let&#8217;s get to Uber&#8217;s streaming solution</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://images.unsplash.com/photo-1615929362369-4af8423ebd68?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxNHx8dWJlcnxlbnwwfHx8fDE3NjcwNzc2NDZ8MA&amp;ixlib=rb-4.1.0&amp;q=80&amp;w=1080" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://images.unsplash.com/photo-1615929362369-4af8423ebd68?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxNHx8dWJlcnxlbnwwfHx8fDE3NjcwNzc2NDZ8MA&amp;ixlib=rb-4.1.0&amp;q=80&amp;w=1080 424w, https://images.unsplash.com/photo-1615929362369-4af8423ebd68?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxNHx8dWJlcnxlbnwwfHx8fDE3NjcwNzc2NDZ8MA&amp;ixlib=rb-4.1.0&amp;q=80&amp;w=1080 848w, https://images.unsplash.com/photo-1615929362369-4af8423ebd68?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxNHx8dWJlcnxlbnwwfHx8fDE3NjcwNzc2NDZ8MA&amp;ixlib=rb-4.1.0&amp;q=80&amp;w=1080 1272w, https://images.unsplash.com/photo-1615929362369-4af8423ebd68?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxNHx8dWJlcnxlbnwwfHx8fDE3NjcwNzc2NDZ8MA&amp;ixlib=rb-4.1.0&amp;q=80&amp;w=1080 1456w" sizes="100vw"><img src="https://images.unsplash.com/photo-1615929362369-4af8423ebd68?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxNHx8dWJlcnxlbnwwfHx8fDE3NjcwNzc2NDZ8MA&amp;ixlib=rb-4.1.0&amp;q=80&amp;w=1080" width="4000" height="6000" data-attrs="{&quot;src&quot;:&quot;https://images.unsplash.com/photo-1615929362369-4af8423ebd68?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxNHx8dWJlcnxlbnwwfHx8fDE3NjcwNzc2NDZ8MA&amp;ixlib=rb-4.1.0&amp;q=80&amp;w=1080&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:6000,&quot;width&quot;:4000,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;person holding black iphone 5&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="person holding black iphone 5" title="person holding black iphone 5" srcset="https://images.unsplash.com/photo-1615929362369-4af8423ebd68?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxNHx8dWJlcnxlbnwwfHx8fDE3NjcwNzc2NDZ8MA&amp;ixlib=rb-4.1.0&amp;q=80&amp;w=1080 424w, https://images.unsplash.com/photo-1615929362369-4af8423ebd68?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxNHx8dWJlcnxlbnwwfHx8fDE3NjcwNzc2NDZ8MA&amp;ixlib=rb-4.1.0&amp;q=80&amp;w=1080 848w, https://images.unsplash.com/photo-1615929362369-4af8423ebd68?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxNHx8dWJlcnxlbnwwfHx8fDE3NjcwNzc2NDZ8MA&amp;ixlib=rb-4.1.0&amp;q=80&amp;w=1080 1272w, https://images.unsplash.com/photo-1615929362369-4af8423ebd68?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxNHx8dWJlcnxlbnwwfHx8fDE3NjcwNzc2NDZ8MA&amp;ixlib=rb-4.1.0&amp;q=80&amp;w=1080 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Photo by <a href="https://unsplash.com/@tingeyinjurylawfirm">Tingey Injury Law Firm</a> on <a href="https://unsplash.com">Unsplash</a></figcaption></figure></div><h3>TL;DR</h3><div><hr></div><h4><strong>Situation</strong></h4><p>Batch-based ingestion meant data freshness was hours to days, slowing experimentation, analytics and ML across Uber&#8217;s core business domains.</p><h4><strong>Task</strong></h4><p>Move ingestion to minutes-level freshness at petabyte scale while lowering compute cost and keeping operations reliable across thousands of datasets.</p><h4><strong>Action</strong></h4><p>Built IngestionNext using Flink streaming from Kafka to Hudi, plus a control plane for operating ingestion at scale. Solved streaming bottlenecks (small files, partition skew, checkpoint vs commit alignment) to keep performance and correctness intact.</p><h4><strong>Result</strong></h4><ul><li><p>Freshness improved from hours to <strong>minutes-level</strong>.</p></li><li><p>Compute usage reduced by <strong>~25%</strong> vs batch ingestion.</p></li><li><p>Compaction performance improved by <strong>~10x</strong> with row-group merging.</p></li></ul><h4><strong>Use Cases</strong></h4><p>Near-real time analytics, personalisation, operational analytics</p><h4><strong>Tech Stack/Framework</strong></h4><p>Apache Spark, Apache Kafka, Apache Flink, Apache Hudi, Apache Parquet</p><div><hr></div><h3>Explained further</h3><div><hr></div><h4>Why data freshness became a platform priority at Uber?</h4><p>Uber&#8217;s data lake sits underneath a lot of the company&#8217;s analytics and machine learning. If a team wants to measure an experiment, monitor performance, train a model or sanity-check a business change, it usually starts with is the data in the lake yet?</p><p>Historically, ingestion into the lake was batch-based. Freshness was measured in hours. That was fine when decisions moved at daily report speed. It starts to hurt when the business wants near-real-time loops: faster experiments, faster model iteration, faster detection of issues.</p><p>Over the past year, the team built and validated <strong>IngestionNext</strong>, a new ingestion system that switches the default mindset from batch to streaming. It&#8217;s centered on Apache Flink, reads events from Kafka, writes to the data lake in Apache Hudi format and operates at petabyte scale. Along the way, they had to solve the stuff that makes streaming annoying in practice: small files, partition skew, checkpoint vs commit alignment and the operational problem of running thousands of jobs reliably.</p><div><hr></div><h4><strong>Why batch ingestion became a bottleneck?</strong></h4><p>Two main reasons: <strong>freshness</strong> and<strong> efficiency</strong>.</p><p><strong>Freshness</strong></p><p>As the business sped up, teams across Delivery, Rider, Mobility, Finance and Marketing Analytics kept asking the same thing: &#8220;Can we get the data sooner?&#8221;</p><p>Batch ingestion creates delays measured in hours and sometimes days. That lag slows down iteration and decision-making. In a world of continuous experimentation and fast model cycles, hours of latency is basically a tax on everything.</p><p>By moving ingestion to Flink-based streaming, the team reduced freshness from hours to minutes. That directly supports faster model launches, quicker experiments and more accurate analytics because the lake stays closer to what&#8217;s happening now.</p><p><strong>Efficiency</strong></p><p>Batch ingestion with Apache Spark is heavy by nature. Jobs run on a schedule, kick off distributed work at fixed intervals and keep doing that even when the workload is uneven. At Uber&#8217;s scale, with thousands of datasets and hundreds of petabytes, that adds up to hundreds of thousands of CPU cores running daily.</p><p>Streaming smooths this out. Instead of repeatedly spinning up large batch work, resources can scale with traffic in a more continuous way. Less overhead from scheduling, less big bang compute and more efficient usage overall.</p><div><hr></div><h4><strong>IngestionNext: A streaming ingestion platform built for scale</strong></h4><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!AYPR!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7f26c0c-b910-4450-916a-c2e372d9c9fa_768x349.avif" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!AYPR!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7f26c0c-b910-4450-916a-c2e372d9c9fa_768x349.avif 424w, https://substackcdn.com/image/fetch/$s_!AYPR!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7f26c0c-b910-4450-916a-c2e372d9c9fa_768x349.avif 848w, https://substackcdn.com/image/fetch/$s_!AYPR!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7f26c0c-b910-4450-916a-c2e372d9c9fa_768x349.avif 1272w, https://substackcdn.com/image/fetch/$s_!AYPR!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7f26c0c-b910-4450-916a-c2e372d9c9fa_768x349.avif 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!AYPR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7f26c0c-b910-4450-916a-c2e372d9c9fa_768x349.avif" width="768" height="349" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f7f26c0c-b910-4450-916a-c2e372d9c9fa_768x349.avif&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:349,&quot;width&quot;:768,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:15051,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/avif&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.datatinkerer.io/i/182833470?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7f26c0c-b910-4450-916a-c2e372d9c9fa_768x349.avif&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!AYPR!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7f26c0c-b910-4450-916a-c2e372d9c9fa_768x349.avif 424w, https://substackcdn.com/image/fetch/$s_!AYPR!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7f26c0c-b910-4450-916a-c2e372d9c9fa_768x349.avif 848w, https://substackcdn.com/image/fetch/$s_!AYPR!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7f26c0c-b910-4450-916a-c2e372d9c9fa_768x349.avif 1272w, https://substackcdn.com/image/fetch/$s_!AYPR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7f26c0c-b910-4450-916a-c2e372d9c9fa_768x349.avif 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">IngestionNext architecture (Source: Uber)</figcaption></figure></div><p>At the data plane, events land in Apache Kafka. Flink jobs consume those events and write them into the data lake using Apache Hudi. Hudi provides transactional behavior like commits, rollbacks and time travel. Freshness and completeness are measured end-to-end from source to sink, not just &#8220;did the job run.&#8221;</p><p>Operating ingestion at this scale is not a set it and forget it situation. So the team built a control plane focused on automation and safety. It manages the ingestion job lifecycle (create, deploy, restart, stop, delete), handles config changes and runs health verification. The goal is simple: run thousands of ingestion jobs consistently without turning the platform into a giant manual babysitting exercise.</p><p>The system also supports regional failover and fallback strategies. If there&#8217;s an outage, ingestion can shift across regions. If needed, jobs can temporarily fall back to batch mode so ingestion stays available and data is not lost.</p><div><hr></div><h4><strong>Solving the hard parts of streaming ingestion</strong></h4><p>Streaming buys freshness but it also introduces new failure modes. The team highlighted three major ones: <strong>small files</strong>, <strong>partition skew</strong> and <strong>checkpoint/commit synchronization</strong>.</p><p><strong>Small files</strong></p><p>Streaming writes data continuously. That tends to create lots of small Parquet files. Small files are a classic way to make query performance worse while also increasing metadata and storage overhead. You get fresher data, then you pay for it every time someone queries.</p><p>The common compaction approach merges Parquet files record by record. That means each file gets decompressed, decoded from columnar format into rows, merged, then encoded and compressed again. It works but it&#8217;s expensive and slow because you keep doing encode/decode work over and over.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!25HV!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c8bb140-efab-484a-be52-7a93cf9214ba_768x527.avif" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!25HV!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c8bb140-efab-484a-be52-7a93cf9214ba_768x527.avif 424w, https://substackcdn.com/image/fetch/$s_!25HV!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c8bb140-efab-484a-be52-7a93cf9214ba_768x527.avif 848w, https://substackcdn.com/image/fetch/$s_!25HV!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c8bb140-efab-484a-be52-7a93cf9214ba_768x527.avif 1272w, https://substackcdn.com/image/fetch/$s_!25HV!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c8bb140-efab-484a-be52-7a93cf9214ba_768x527.avif 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!25HV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c8bb140-efab-484a-be52-7a93cf9214ba_768x527.avif" width="768" height="527" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7c8bb140-efab-484a-be52-7a93cf9214ba_768x527.avif&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:527,&quot;width&quot;:768,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:31057,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/avif&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.datatinkerer.io/i/182833470?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c8bb140-efab-484a-be52-7a93cf9214ba_768x527.avif&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!25HV!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c8bb140-efab-484a-be52-7a93cf9214ba_768x527.avif 424w, https://substackcdn.com/image/fetch/$s_!25HV!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c8bb140-efab-484a-be52-7a93cf9214ba_768x527.avif 848w, https://substackcdn.com/image/fetch/$s_!25HV!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c8bb140-efab-484a-be52-7a93cf9214ba_768x527.avif 1272w, https://substackcdn.com/image/fetch/$s_!25HV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c8bb140-efab-484a-be52-7a93cf9214ba_768x527.avif 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Parquet file merging row by row (Source: Uber)</figcaption></figure></div><p>To fix this, the team introduced row-group-level merging. Instead of dropping down into row format, the merge operates directly on Parquet&#8217;s native columnar structure. That avoids the expensive recompression path and improves compaction performance by more than an order of magnitude, around 10x.</p><p>There are open-source efforts exploring schema-evolution-aware merging using padding and masking to align schemas but that comes with added implementation complexity and maintenance risk.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!eGhg!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe2dfa45-17b0-4679-9e59-439fe164f2d7_768x347.avif" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!eGhg!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe2dfa45-17b0-4679-9e59-439fe164f2d7_768x347.avif 424w, https://substackcdn.com/image/fetch/$s_!eGhg!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe2dfa45-17b0-4679-9e59-439fe164f2d7_768x347.avif 848w, https://substackcdn.com/image/fetch/$s_!eGhg!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe2dfa45-17b0-4679-9e59-439fe164f2d7_768x347.avif 1272w, https://substackcdn.com/image/fetch/$s_!eGhg!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe2dfa45-17b0-4679-9e59-439fe164f2d7_768x347.avif 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!eGhg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe2dfa45-17b0-4679-9e59-439fe164f2d7_768x347.avif" width="768" height="347" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fe2dfa45-17b0-4679-9e59-439fe164f2d7_768x347.avif&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:347,&quot;width&quot;:768,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:5298,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/avif&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.datatinkerer.io/i/182833470?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe2dfa45-17b0-4679-9e59-439fe164f2d7_768x347.avif&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!eGhg!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe2dfa45-17b0-4679-9e59-439fe164f2d7_768x347.avif 424w, https://substackcdn.com/image/fetch/$s_!eGhg!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe2dfa45-17b0-4679-9e59-439fe164f2d7_768x347.avif 848w, https://substackcdn.com/image/fetch/$s_!eGhg!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe2dfa45-17b0-4679-9e59-439fe164f2d7_768x347.avif 1272w, https://substackcdn.com/image/fetch/$s_!eGhg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe2dfa45-17b0-4679-9e59-439fe164f2d7_768x347.avif 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Row-group merging with data masking (Source: Uber)</figcaption></figure></div><p>So the team took a simpler path: enforce schema consistency during merging. Only files with identical schema are merged together. No masking, no low-level code modifications, less engineering overhead and still faster, more efficient and more reliable compaction.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!vAV6!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3c4bbe0-5d7e-4e7e-ac99-706f52a1f0e9_768x375.avif" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!vAV6!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3c4bbe0-5d7e-4e7e-ac99-706f52a1f0e9_768x375.avif 424w, https://substackcdn.com/image/fetch/$s_!vAV6!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3c4bbe0-5d7e-4e7e-ac99-706f52a1f0e9_768x375.avif 848w, https://substackcdn.com/image/fetch/$s_!vAV6!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3c4bbe0-5d7e-4e7e-ac99-706f52a1f0e9_768x375.avif 1272w, https://substackcdn.com/image/fetch/$s_!vAV6!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3c4bbe0-5d7e-4e7e-ac99-706f52a1f0e9_768x375.avif 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!vAV6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3c4bbe0-5d7e-4e7e-ac99-706f52a1f0e9_768x375.avif" width="768" height="375" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f3c4bbe0-5d7e-4e7e-ac99-706f52a1f0e9_768x375.avif&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:375,&quot;width&quot;:768,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:6507,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/avif&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.datatinkerer.io/i/182833470?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3c4bbe0-5d7e-4e7e-ac99-706f52a1f0e9_768x375.avif&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!vAV6!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3c4bbe0-5d7e-4e7e-ac99-706f52a1f0e9_768x375.avif 424w, https://substackcdn.com/image/fetch/$s_!vAV6!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3c4bbe0-5d7e-4e7e-ac99-706f52a1f0e9_768x375.avif 848w, https://substackcdn.com/image/fetch/$s_!vAV6!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3c4bbe0-5d7e-4e7e-ac99-706f52a1f0e9_768x375.avif 1272w, https://substackcdn.com/image/fetch/$s_!vAV6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3c4bbe0-5d7e-4e7e-ac99-706f52a1f0e9_768x375.avif 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Simplified row-group merging by groping schema (Source: Uber)</figcaption></figure></div><p><strong>Partition skew</strong></p><p>Streaming ingestion depends on steady consumption from Kafka across Flink subtasks. The messy reality is that short-lived downstream slowdowns, like garbage collection pauses can unbalance consumption. Some partitions get read more than others. You end up with skew.</p><p>Skew doesn&#8217;t just look ugly on a dashboard. It can reduce compression efficiency and lead to slower queries downstream.</p><p>The fixes came from three angles:</p><ul><li><p><strong>Operational tuning:</strong> aligning Flink parallelism with Kafka partitions and adjusting fetch parameters.</p></li><li><p><strong>Connector-level fairness:</strong> adding mechanisms like round-robin polling, pause/resume for heavy partitions and per-partition quotas.</p></li><li><p><strong>Observability:</strong> exposing per-partition lag metrics, adding skew-aware autoscaling and setting targeted alerts.</p></li></ul><p>This is a good reminder that streaming issues often show up first as weird lag and then become why are queries slower now&#8221; If you can&#8217;t see skew clearly, you&#8217;ll chase symptoms forever.</p><p><strong>Checkpoint and commit synchronization</strong></p><p>Flink and Hudi each track progress but they track different things.</p><ul><li><p><strong>Flink checkpoints</strong> track consumed offsets.</p></li><li><p><strong>Hudi commits</strong> track writes.</p></li></ul><p>If failures happen and these drift out of sync, the system can skip data or duplicate it. In ingestion, either outcome is a serious problem.</p><p>The team solved this by extending Hudi commit metadata to embed Flink checkpoint IDs. With that linkage, recovery becomes deterministic during rollbacks or failovers. The system can reason about which checkpoint corresponds to which commit and recover without guessing.</p><div><hr></div><h4><strong>Production results: faster data with lower cost</strong></h4><p>The team onboarded datasets to the Flink-based ingestion platform and validated performance on some of Uber&#8217;s largest datasets.</p><p>The early results:</p><ul><li><p><strong>Freshness:</strong> improved from hours to <strong>minutes-level freshness</strong>.</p></li><li><p><strong>Efficiency:</strong> <strong>25% reduction in compute usage</strong> compared to batch ingestion.</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!HbzO!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16e2433c-5ac8-447c-b7de-9562a68daebd_768x210.avif" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!HbzO!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16e2433c-5ac8-447c-b7de-9562a68daebd_768x210.avif 424w, https://substackcdn.com/image/fetch/$s_!HbzO!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16e2433c-5ac8-447c-b7de-9562a68daebd_768x210.avif 848w, https://substackcdn.com/image/fetch/$s_!HbzO!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16e2433c-5ac8-447c-b7de-9562a68daebd_768x210.avif 1272w, https://substackcdn.com/image/fetch/$s_!HbzO!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16e2433c-5ac8-447c-b7de-9562a68daebd_768x210.avif 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!HbzO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16e2433c-5ac8-447c-b7de-9562a68daebd_768x210.avif" width="768" height="210" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/16e2433c-5ac8-447c-b7de-9562a68daebd_768x210.avif&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:210,&quot;width&quot;:768,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:7326,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/avif&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.datatinkerer.io/i/182833470?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16e2433c-5ac8-447c-b7de-9562a68daebd_768x210.avif&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!HbzO!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16e2433c-5ac8-447c-b7de-9562a68daebd_768x210.avif 424w, https://substackcdn.com/image/fetch/$s_!HbzO!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16e2433c-5ac8-447c-b7de-9562a68daebd_768x210.avif 848w, https://substackcdn.com/image/fetch/$s_!HbzO!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16e2433c-5ac8-447c-b7de-9562a68daebd_768x210.avif 1272w, https://substackcdn.com/image/fetch/$s_!HbzO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16e2433c-5ac8-447c-b7de-9562a68daebd_768x210.avif 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Before and after streaming ingestion (Source: Uber)</figcaption></figure></div><div><hr></div><h4><strong>Extending real-time beyond ingestion</strong></h4><p>IngestionNext improves ingestion latency from online Kafka into the offline raw data lake. That&#8217;s a big step but it&#8217;s not the full story.</p><p>Freshness still stalls downstream in transformation and analytics layers. If ingestion is minutes but transformation is still slow, the point of decision is still stale.</p><p>The next frontier for Uber is extending real-time capability end-to-end: <strong>ingestion &#8594; transformation &#8594; real-time insights and analytics</strong>. This matters because Uber&#8217;s lake powers a long list of domains: Delivery, Mobility, Machine Learning, Rider, Marketplace, Maps, Finance and Marketing Analytics. Freshness is a cross-cutting requirement.</p><div><hr></div><h4><strong>Conclusion</strong></h4><p>Uber&#8217;s shift from batch to streaming ingestion is a meaningful platform milestone. By re-architecting ingestion around Apache Flink, IngestionNext delivers fresher data, stronger reliability and scalable efficiency across a petabyte-scale lake.</p><p>The design is not just run Flink jobs. It includes operational foundations like an automated control plane, resiliency strategies and streaming-specific engineering work: faster compaction via row-group merging, skew controls and deterministic recovery by linking Flink checkpoints to Hudi commits.</p><p>The bigger idea is the mindset shift: treating freshness as a first-class dimension of data quality. With IngestionNext proven in production, the next push is clear: bring streaming into downstream transformation and analytics so the company can close the real-time loop, not just ingest data faster.</p><div><hr></div><h3>The full scoop</h3><p>To learn more about this, check <a href="https://www.uber.com/en-AU/blog/from-batch-to-streaming-accelerating-data-freshness-in-ubers-data-lake/">Uber's Engineering Blog</a> post on this topic</p><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.datatinkerer.io/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">If you liked this post and don&#8217;t want to miss the next one, subscribe to Data Tinkerer!</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p>If you are already subscribed and enjoyed the article, please give it a like and/or share it others, really appreciate it &#128591;</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.datatinkerer.io/p/how-uber-cut-data-lake-freshness-from-hours-to-minutes-with-flink?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.datatinkerer.io/p/how-uber-cut-data-lake-freshness-from-hours-to-minutes-with-flink?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p><div><hr></div><h3>Keep learning</h3><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;1904c23f-5462-4150-9c60-b6ad712234b6&quot;,&quot;caption&quot;:&quot;How do you keep ML teams fast when every experiment blasts your Spark cluster with spiky workloads, huge datasets and five different file formats?<br /><br />Snap&#8217;s answer: Prism, a unified Spark platform that hides infra pain, standardises pipelines and supports everything from ad-hoc exploration to 10k+ daily jobs in production.<br /><br />This post breaks down why raw Spark wasn&#8217;t enough, what Prism fixes and how Snap rebuilt their ML data layer without ditching Spark.&quot;,&quot;cta&quot;:&quot;Read full story&quot;,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;lg&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;How Snap Rebuilt Its ML Platform to Handle 10,000+ Daily Spark Jobs&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:291590442,&quot;name&quot;:&quot;Data Tinkerer&quot;,&quot;bio&quot;:&quot;Ex-head of analytics sharing deep dives and learnings about AI and all things data (science, engineering, analysis)&quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/83d3bbe0-5fb8-4f8d-9b74-036abbd6fec9_500x500.png&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2025-11-20T04:59:47.340Z&quot;,&quot;cover_image&quot;:&quot;https://images.unsplash.com/photo-1620396749162-d61bdb715793?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHw1fHxzbmFwY2hhdHxlbnwwfHx8fDE3NjM1Mjk0MDZ8MA&amp;ixlib=rb-4.1.0&amp;q=80&amp;w=1080&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://www.datatinkerer.io/p/how-snap-rebuilt-its-ml-platform-to-handle-10000-daily-spark-jobs&quot;,&quot;section_name&quot;:&quot;Data Engineering&quot;,&quot;video_upload_id&quot;:null,&quot;id&quot;:179211962,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:9,&quot;comment_count&quot;:0,&quot;publication_id&quot;:3422740,&quot;publication_name&quot;:&quot;Data Tinkerer&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!JEdj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6bea5ccd-f356-4154-a1db-16268300510e_500x500.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;8b717c15-913f-4e54-91a7-fb3f26e15721&quot;,&quot;caption&quot;:&quot;How do you keep data fresh for millions of merchants when you&#8217;re streaming from 100+ MySQL shards?<br /><br />Shopify&#8217;s answer: a 400TB Change Data Capture platform that pushes up to 100k events a second.<br /><br />This post dives into the trade-offs, the challenges and the lessons learned from building CDC at scale.&quot;,&quot;cta&quot;:&quot;Read full story&quot;,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;lg&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;How Shopify Uses Change Data Capture to Serve Millions of Merchants&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:291590442,&quot;name&quot;:&quot;Data Tinkerer&quot;,&quot;bio&quot;:&quot;Ex-head of analytics sharing deep dives and learnings about AI and all things data (science, engineering, analysis)&quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/83d3bbe0-5fb8-4f8d-9b74-036abbd6fec9_500x500.png&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2025-09-18T07:53:42.206Z&quot;,&quot;cover_image&quot;:&quot;https://images.unsplash.com/photo-1730818874996-dea4bddf5554?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHw1fHxzaG9waWZ5fGVufDB8fHx8MTc1ODE4MDY0NHww&amp;ixlib=rb-4.1.0&amp;q=80&amp;w=1080&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://www.datatinkerer.io/p/how-shopify-uses-change-data-capture-to-serve-millions-of-merchants&quot;,&quot;section_name&quot;:&quot;Data Engineering&quot;,&quot;video_upload_id&quot;:null,&quot;id&quot;:173822667,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:10,&quot;comment_count&quot;:0,&quot;publication_id&quot;:3422740,&quot;publication_name&quot;:&quot;Data Tinkerer&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!JEdj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6bea5ccd-f356-4154-a1db-16268300510e_500x500.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div>]]></content:encoded></item><item><title><![CDATA[How Snap Rebuilt Its ML Platform to Handle 10,000+ Daily Spark Jobs]]></title><description><![CDATA[Inside Prism, the system that turned scattered Spark workflows into a unified, ML-ready platform.]]></description><link>https://www.datatinkerer.io/p/how-snap-rebuilt-its-ml-platform-to-handle-10000-daily-spark-jobs</link><guid isPermaLink="false">https://www.datatinkerer.io/p/how-snap-rebuilt-its-ml-platform-to-handle-10000-daily-spark-jobs</guid><dc:creator><![CDATA[Data Tinkerer]]></dc:creator><pubDate>Thu, 20 Nov 2025 04:59:47 GMT</pubDate><enclosure url="https://images.unsplash.com/photo-1620396749162-d61bdb715793?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHw1fHxzbmFwY2hhdHxlbnwwfHx8fDE3NjM1Mjk0MDZ8MA&amp;ixlib=rb-4.1.0&amp;q=80&amp;w=1080" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Fellow Data Tinkerers!</p><p>Today we will look at how Snap unified Spark, ML workflows and 10k+ daily jobs under one platform.</p><p>But before that, I wanted to share with you what you could unlock if you share Data Tinkerer with just <strong>1 more person</strong>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!TBRd!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F177f23c2-03d2-4ec5-bc67-33e5bb8b23fd_800x402.gif" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!TBRd!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F177f23c2-03d2-4ec5-bc67-33e5bb8b23fd_800x402.gif 424w, https://substackcdn.com/image/fetch/$s_!TBRd!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F177f23c2-03d2-4ec5-bc67-33e5bb8b23fd_800x402.gif 848w, https://substackcdn.com/image/fetch/$s_!TBRd!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F177f23c2-03d2-4ec5-bc67-33e5bb8b23fd_800x402.gif 1272w, https://substackcdn.com/image/fetch/$s_!TBRd!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F177f23c2-03d2-4ec5-bc67-33e5bb8b23fd_800x402.gif 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!TBRd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F177f23c2-03d2-4ec5-bc67-33e5bb8b23fd_800x402.gif" width="800" height="402" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/177f23c2-03d2-4ec5-bc67-33e5bb8b23fd_800x402.gif&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:402,&quot;width&quot;:800,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:3369150,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/gif&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.datatinkerer.io/i/179211962?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F177f23c2-03d2-4ec5-bc67-33e5bb8b23fd_800x402.gif&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!TBRd!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F177f23c2-03d2-4ec5-bc67-33e5bb8b23fd_800x402.gif 424w, https://substackcdn.com/image/fetch/$s_!TBRd!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F177f23c2-03d2-4ec5-bc67-33e5bb8b23fd_800x402.gif 848w, https://substackcdn.com/image/fetch/$s_!TBRd!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F177f23c2-03d2-4ec5-bc67-33e5bb8b23fd_800x402.gif 1272w, https://substackcdn.com/image/fetch/$s_!TBRd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F177f23c2-03d2-4ec5-bc67-33e5bb8b23fd_800x402.gif 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>There are 100+ resources to learn all things data (science, engineering, analysis). It includes videos, courses, projects and can be filtered by tech stack (Python, SQL, Spark and etc), skill level (Beginner, Intermediate and so on) provider name or free/paid. So if you know other people who like staying up to date on all things data, please share Data Tinkerer with them!</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.datatinkerer.io/leaderboard?&amp;utm_source=post&quot;,&quot;text&quot;:&quot;Refer a friend&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.datatinkerer.io/leaderboard?&amp;utm_source=post"><span>Refer a friend</span></a></p><p>Now, with that out of the way, let&#8217;s get to Snap&#8217;s ML platform transformation.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://images.unsplash.com/photo-1620396749162-d61bdb715793?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHw1fHxzbmFwY2hhdHxlbnwwfHx8fDE3NjM1Mjk0MDZ8MA&amp;ixlib=rb-4.1.0&amp;q=80&amp;w=1080" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://images.unsplash.com/photo-1620396749162-d61bdb715793?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHw1fHxzbmFwY2hhdHxlbnwwfHx8fDE3NjM1Mjk0MDZ8MA&amp;ixlib=rb-4.1.0&amp;q=80&amp;w=1080 424w, https://images.unsplash.com/photo-1620396749162-d61bdb715793?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHw1fHxzbmFwY2hhdHxlbnwwfHx8fDE3NjM1Mjk0MDZ8MA&amp;ixlib=rb-4.1.0&amp;q=80&amp;w=1080 848w, https://images.unsplash.com/photo-1620396749162-d61bdb715793?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHw1fHxzbmFwY2hhdHxlbnwwfHx8fDE3NjM1Mjk0MDZ8MA&amp;ixlib=rb-4.1.0&amp;q=80&amp;w=1080 1272w, https://images.unsplash.com/photo-1620396749162-d61bdb715793?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHw1fHxzbmFwY2hhdHxlbnwwfHx8fDE3NjM1Mjk0MDZ8MA&amp;ixlib=rb-4.1.0&amp;q=80&amp;w=1080 1456w" sizes="100vw"><img src="https://images.unsplash.com/photo-1620396749162-d61bdb715793?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHw1fHxzbmFwY2hhdHxlbnwwfHx8fDE3NjM1Mjk0MDZ8MA&amp;ixlib=rb-4.1.0&amp;q=80&amp;w=1080" width="4000" height="6000" data-attrs="{&quot;src&quot;:&quot;https://images.unsplash.com/photo-1620396749162-d61bdb715793?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHw1fHxzbmFwY2hhdHxlbnwwfHx8fDE3NjM1Mjk0MDZ8MA&amp;ixlib=rb-4.1.0&amp;q=80&amp;w=1080&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:6000,&quot;width&quot;:4000,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;text&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="text" title="text" srcset="https://images.unsplash.com/photo-1620396749162-d61bdb715793?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHw1fHxzbmFwY2hhdHxlbnwwfHx8fDE3NjM1Mjk0MDZ8MA&amp;ixlib=rb-4.1.0&amp;q=80&amp;w=1080 424w, https://images.unsplash.com/photo-1620396749162-d61bdb715793?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHw1fHxzbmFwY2hhdHxlbnwwfHx8fDE3NjM1Mjk0MDZ8MA&amp;ixlib=rb-4.1.0&amp;q=80&amp;w=1080 848w, https://images.unsplash.com/photo-1620396749162-d61bdb715793?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHw1fHxzbmFwY2hhdHxlbnwwfHx8fDE3NjM1Mjk0MDZ8MA&amp;ixlib=rb-4.1.0&amp;q=80&amp;w=1080 1272w, https://images.unsplash.com/photo-1620396749162-d61bdb715793?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHw1fHxzbmFwY2hhdHxlbnwwfHx8fDE3NjM1Mjk0MDZ8MA&amp;ixlib=rb-4.1.0&amp;q=80&amp;w=1080 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Photo by <a href="https://unsplash.com/@maygauthier">May Gauthier</a> on <a href="https://unsplash.com">Unsplash</a></figcaption></figure></div><h3>TL;DR</h3><div><hr></div><h4><strong>Situation</strong></h4><p>Snap&#8217;s ML teams relied on Apache Spark for analytics but raw Spark was painful for ML workflows: spiky, iterative training data jobs, multiple data formats, scattered tooling and heavy cluster babysitting for every experiment.</p><h4><strong>Task</strong></h4><p>They needed an ML-focused data platform on top of Spark that hid infra, handled diverse formats, supported fast experimentation through to stable production and gave a single, coherent experience for Spark users.</p><h4><strong>Action</strong></h4><p>They built Prism, a unified Spark platform with a UI and SDK, config-driven Prism Templates to define jobs in YAML, a control plane with Temporal workflows for cluster lifecycle, centralised metrics, autotuning and deep integration with Snap&#8217;s billing, orchestration and lakehouse tools.</p><h4><strong>Result</strong></h4><p>Prism grew from a handful of daily jobs to several thousand per day, with peaks over 10k, cut onboarding friction, standardised patterns, improved reliability and let ML engineers focus on experiments and models instead of Spark internals and cluster management.</p><h4><strong>Use Cases</strong></h4><p>Feature engineering, batch model pipelines, lakehouse ingestion, experiment workflows</p><h4><strong>Tech Stack/Framework</strong></h4><p>Apache Spark, Apache Iceberg, Dataproc, Apache Airflow, Kubeflow, Apache Parquet, Trino</p><div><hr></div><h3>Explained further</h3><div><hr></div><h4>Context</h4><p>Apache Spark has been a core part of Snap&#8217;s analytical stack for a long time. It runs the pipelines that feed reports, dashboards and batch data products. For that world, Spark is a good fit.</p><p>Machine learning puts new pressure on that foundation.</p><p><strong>ML development is inherently iterative</strong>. An engineer can spend a week trying variations of the same broad idea: a different label definition, a refined feature set, a new way to slice users, a new pre-processing recipe. Each iteration often means regenerating training data from very large raw sources. Doing that repeatedly is not a nice, predictable nightly job. It is a series of intense, sometimes spiky workloads that hit the platform whenever someone has another idea.</p><p><strong>The development lifecycle is also more fluid.</strong> Early in a project, ML engineers want freedom. They want to pull ad hoc samples, tweak schemas on the fly, and see results quickly. Once the same model is ready for production, the expectations flip. Pipelines need to be stable, observable and efficient on real traffic. The platform has to support both modes without asking people to throw away all their early work and start again from scratch.</p><p><strong>Then there is the question of formats.</strong> ML workloads do not live in a single file format. Snap&#8217;s teams use:</p><ul><li><p>TFRecord when they are feeding TensorFlow</p></li><li><p>Protobuf when they are working with gRPC-based serving systems</p></li><li><p>JSON for lightweight exploration and simple tests</p></li><li><p>Parquet and Iceberg for analytical and lakehouse-style storage </p></li></ul><p>Forcing everything through a single &#8220;blessed&#8221; format would only slow teams down. A realistic platform needs to work comfortably across all of these.</p><div><hr></div><h4><strong>Where raw Spark started to hurt ML teams</strong></h4><p>Spark itself is not the weak link. It is powerful, battle tested and extremely good at scaling SQL and batch workloads. The problem is not capability; it is usability for ML engineers.</p><p>Without the right abstractions:</p><ul><li><p>Engineers need to understand Spark internals and distributed systems just to write reasonable jobs.</p></li><li><p>They rebuild the same boilerplate, like data validation or common preprocessing across teams.</p></li><li><p>They manage clusters, dependencies, and upgrades themselves.</p></li><li><p>They spend time in Spark UI and logs chasing down failures that add no value to the model itself.</p></li></ul><p>All of this is on top of their actual core stack: TensorFlow or PyTorch, notebooks, experiment tracking tools, workflow systems like Kubeflow and internal ML platforms. Spark is an ingredient they need for scale, not the centre of their role. When the support around it is thin, that ingredient becomes a constant source of overhead.</p><p>Snap&#8217;s ML engineers wanted to spend their time on experiments and models, not on reverse engineering cluster failures. That is the capability gap the team set out to close.</p><div><hr></div><h4><strong>What Snap wanted from an ML data platform</strong></h4><p>The Snap team set a clear goal: build an ML-focused data platform on top of Spark that feels consistent, friendly, and scalable, instead of &#8220;bare-metal Spark with some helpers&#8221;.</p><p>The platform should let ML engineers:</p><ul><li><p>Describe what data they need instead of hand-coding how to compute it</p></li><li><p>Iterate quickly without piles of glue code</p></li><li><p>Reuse proven, secure patterns for data processing</p></li><li><p>Spend their energy on model and product logic instead of infrastructure</p></li></ul><p>That means heavy lifting in one place: infrastructure abstraction, patterns, observability, integration with Snap&#8217;s internal ecosystem for metrics, billing, cost tracking, scheduling.</p><p>With that foundation in place, ML development becomes faster and more consistent, and the platform team can invest in shared improvements instead of firefighting one-off jobs.</p><div><hr></div><h4><strong>Boiling the problem down</strong></h4><p>So the problem is not &#8220;Spark is bad for ML&#8221;. The problem is that raw Spark is too low level for the way ML teams actually work.</p><p>What Snap&#8217;s team built with Prism is a layer on top of Spark that:</p><ul><li><p>Hides cluster-level pain</p></li><li><p>Standardizes job patterns</p></li><li><p>Bakes in observability and cost awareness</p></li><li><p>Fits naturally into ML workflows rather than generic analytics</p></li></ul><p>Prism is the answer built around those constraints. It keeps Spark, but wraps it in a set of tools that match how ML teams at Snap actually work.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!h-7U!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faca80c17-a2dd-4633-8155-6a21f96599cd_2588x1153.avif" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!h-7U!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faca80c17-a2dd-4633-8155-6a21f96599cd_2588x1153.avif 424w, https://substackcdn.com/image/fetch/$s_!h-7U!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faca80c17-a2dd-4633-8155-6a21f96599cd_2588x1153.avif 848w, https://substackcdn.com/image/fetch/$s_!h-7U!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faca80c17-a2dd-4633-8155-6a21f96599cd_2588x1153.avif 1272w, https://substackcdn.com/image/fetch/$s_!h-7U!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faca80c17-a2dd-4633-8155-6a21f96599cd_2588x1153.avif 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!h-7U!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faca80c17-a2dd-4633-8155-6a21f96599cd_2588x1153.avif" width="1456" height="649" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/aca80c17-a2dd-4633-8155-6a21f96599cd_2588x1153.avif&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:649,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:90288,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/avif&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.datatinkerer.io/i/179211962?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faca80c17-a2dd-4633-8155-6a21f96599cd_2588x1153.avif&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!h-7U!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faca80c17-a2dd-4633-8155-6a21f96599cd_2588x1153.avif 424w, https://substackcdn.com/image/fetch/$s_!h-7U!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faca80c17-a2dd-4633-8155-6a21f96599cd_2588x1153.avif 848w, https://substackcdn.com/image/fetch/$s_!h-7U!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faca80c17-a2dd-4633-8155-6a21f96599cd_2588x1153.avif 1272w, https://substackcdn.com/image/fetch/$s_!h-7U!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faca80c17-a2dd-4633-8155-6a21f96599cd_2588x1153.avif 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The new solution requirements (source: Snap)</figcaption></figure></div><div><hr></div><p><strong>Meet Prism: Snap&#8217;s ML data platform on Spark</strong></p><p>Prism is Snap&#8217;s unified Spark platform. From the outside, it looks like one coherent system that handles job authoring, productionisation and post-production operations.</p><p>Instead of each team creating its own way of submitting Spark jobs and wiring up clusters, Prism offers a consistent experience. Engineers can define jobs in a configuration-driven way, work in a UI when they want a visual surface, or go through an SDK when they need more control. Underneath, the platform handles cluster lifecycle, resource management, metrics collection and cost accounting.</p><p>Prism also aims for a serverless feel. Users submit work and adjust configurations, while the system decides how to spin up clusters, scale them and shut them down. That does not remove Spark, but it changes how people interact with it.</p><p><strong>From experiment to production: the Prism user journey</strong></p><p>If you follow a typical workflow through Prism, you see three distinct phases.</p><p><strong>In pre-production, engineers are experimenting.</strong> They want to get a job off the ground quickly, test logic and refine their approach. Prism supports this by offering configuration-based templates and a UI where most of the setup is already done. Predefined profiles cover common use cases which means teams are not spending their first week just tuning cluster settings.</p><p><strong>Once a job is ready to run regularly, it enters productisation.</strong> At this point, Prism controls cluster setup, scaling and teardown through a unified API. Jobs can be tied into orchestration tools such as Airflow or Kubeflow without every team reinventing the wheel. Dashboards, metrics and metadata tracked by Prism give users a cleaner window into how their jobs behave.</p><p><strong>Post-production, attention moves to reliability and efficiency.</strong> Prism takes on this work by centralising monitoring and alerts, storing rich metrics and offering autotuning features that can recommend or apply improvements. Job costs are tracked, and infrastructure upgrades happen at the platform layer instead of per-team.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Kdk9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff38c2eae-4802-4a68-b3b3-2a77fcb112ec_1057x400.avif" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Kdk9!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff38c2eae-4802-4a68-b3b3-2a77fcb112ec_1057x400.avif 424w, https://substackcdn.com/image/fetch/$s_!Kdk9!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff38c2eae-4802-4a68-b3b3-2a77fcb112ec_1057x400.avif 848w, https://substackcdn.com/image/fetch/$s_!Kdk9!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff38c2eae-4802-4a68-b3b3-2a77fcb112ec_1057x400.avif 1272w, https://substackcdn.com/image/fetch/$s_!Kdk9!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff38c2eae-4802-4a68-b3b3-2a77fcb112ec_1057x400.avif 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Kdk9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff38c2eae-4802-4a68-b3b3-2a77fcb112ec_1057x400.avif" width="1057" height="400" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f38c2eae-4802-4a68-b3b3-2a77fcb112ec_1057x400.avif&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:400,&quot;width&quot;:1057,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:14912,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/avif&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.datatinkerer.io/i/179211962?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff38c2eae-4802-4a68-b3b3-2a77fcb112ec_1057x400.avif&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Kdk9!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff38c2eae-4802-4a68-b3b3-2a77fcb112ec_1057x400.avif 424w, https://substackcdn.com/image/fetch/$s_!Kdk9!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff38c2eae-4802-4a68-b3b3-2a77fcb112ec_1057x400.avif 848w, https://substackcdn.com/image/fetch/$s_!Kdk9!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff38c2eae-4802-4a68-b3b3-2a77fcb112ec_1057x400.avif 1272w, https://substackcdn.com/image/fetch/$s_!Kdk9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff38c2eae-4802-4a68-b3b3-2a77fcb112ec_1057x400.avif 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Prism components (Source: Snap)</figcaption></figure></div><div><hr></div><h4><strong>Prism architecture</strong></h4><p>Prism&#8217;s architecture is organised around two main user-facing interfaces and a set of internal systems.</p><p>For users, the first touchpoint is the Prism UI, a console where they can author jobs, inspect runs, debug failures and tune performance. The second is a client SDK that exposes Prism&#8217;s capabilities programmatically. Together, these give ML and data engineers both an interactive and an automated way to work with Spark.</p><p>Behind those sits Prism Template, a framework for composing Spark jobs out of structured, reusable blocks. Instead of asking every engineer to shape their own Spark application, Prism Template gives them a vocabulary of modules they can chain together with YAML.</p><p>All external requests hit a central API and then flow into the Prism Control Plane. This control plane is responsible for managing job metadata and configuration. It delegates orchestration work to a workflow system powered by Temporal. Temporal workflows handle cluster provisioning, job submission, retries, cancellation and similar runtime tasks.</p><p>The whole stack ties into Snap&#8217;s internal services for metrics, cost tracking and orchestration. The idea is to have one platform that understands both the Spark world and the rest of Snap&#8217;s infrastructure landscape.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!2ACo!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3ce2eff-d230-4cda-8a54-b771e0470e38_1600x1265.avif" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!2ACo!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3ce2eff-d230-4cda-8a54-b771e0470e38_1600x1265.avif 424w, https://substackcdn.com/image/fetch/$s_!2ACo!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3ce2eff-d230-4cda-8a54-b771e0470e38_1600x1265.avif 848w, https://substackcdn.com/image/fetch/$s_!2ACo!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3ce2eff-d230-4cda-8a54-b771e0470e38_1600x1265.avif 1272w, https://substackcdn.com/image/fetch/$s_!2ACo!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3ce2eff-d230-4cda-8a54-b771e0470e38_1600x1265.avif 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!2ACo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3ce2eff-d230-4cda-8a54-b771e0470e38_1600x1265.avif" width="1456" height="1151" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f3ce2eff-d230-4cda-8a54-b771e0470e38_1600x1265.avif&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1151,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:33398,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/avif&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.datatinkerer.io/i/179211962?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3ce2eff-d230-4cda-8a54-b771e0470e38_1600x1265.avif&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!2ACo!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3ce2eff-d230-4cda-8a54-b771e0470e38_1600x1265.avif 424w, https://substackcdn.com/image/fetch/$s_!2ACo!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3ce2eff-d230-4cda-8a54-b771e0470e38_1600x1265.avif 848w, https://substackcdn.com/image/fetch/$s_!2ACo!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3ce2eff-d230-4cda-8a54-b771e0470e38_1600x1265.avif 1272w, https://substackcdn.com/image/fetch/$s_!2ACo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3ce2eff-d230-4cda-8a54-b771e0470e38_1600x1265.avif 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Prism architecture (Source: Snap)</figcaption></figure></div><p><strong>A single home for Spark users: Prism UI console</strong></p><p>Before the Prism UI, Spark users at Snap lived inside a messy toolkit. They had an infra-owned library and CLI to standardise job submission, then Spark UI and History Server to debug jobs, HDFS tooling for storage, Dataproc Console and Stackdriver for cloud-level views, and separate internal systems for metrics and cost.</p><p>Each of these tools did something useful, but taken together, they were a scattered experience. New users had to learn not only Spark, but also the map of where to click when something went wrong. Even experienced teams were wasting time stitching context together from five tabs.</p><p>The Prism UI Console was built to compress this spread. It gives Spark users a single place to search jobs, view runs, understand configuration, inspect metrics and author new work. Platform teams now have a clear surface to invest in, and engineers have one source of truth for their Spark workloads.</p><div class="native-video-embed" data-component-name="VideoPlaceholder" data-attrs="{&quot;mediaUploadId&quot;:&quot;1753870b-b071-44e3-9630-3b5947620d00&quot;,&quot;duration&quot;:null}"></div><p>The result is lower friction, faster onboarding and a much clearer path for incremental usability improvements in the future.</p><p><strong>What you can do in the Prism UI</strong></p><p>The Prism UI is the primary surface area for Spark users at Snap. Key capabilities include:</p><ul><li><p><strong>Unified job search: </strong>A central search page lets users filter by job name, namespace, cluster ID, and other attributes. When something breaks, they no longer have to bookmark multiple systems.</p></li><li><p><strong>Metadata storage: </strong>Job and cluster metadata, such as configurations, metrics, and lineage, are stored in a scalable backend. This supports analytics, audits, and better platform decisions.</p></li><li><p><strong>Logical job grouping for trend analysis: </strong>Jobs are grouped by orchestration task IDs, for example from Airflow or Kubeflow. This makes it easy to look at long-term trends in runtime, cost, and resource use for a specific workflow.</p></li><li><p><strong>Integrated real-time cost estimation: </strong>Through integration with internal billing systems, users can see cost estimates while jobs run. This is especially helpful during heavy experimentation when budgets matter.</p></li><li><p><strong>One-click utilities and deep links: </strong>Utilities like job cloning, as well as deep links into Spark UI, logs, and output tables, make iteration and debugging faster.</p></li><li><p><strong>Integrated job authoring: </strong>Users can</p><ul><li><p>Configure sources and sinks with built-in support for Iceberg, TFRecords, Parquet, and BigQuery</p></li><li><p>Use an SQL editor with autocomplete powered by a metastore-aware schema integration</p></li><li><p>Pick preconfigured job profiles created by Spark experts</p></li><li><p>Jump directly to the Lakehouse UI for Iceberg-backed outputs and query them via Trino</p></li><li><p>Create low-code jobs using Prism Templates</p></li></ul></li></ul><p>This is where most ML users feel the impact. Instead of wrestling with scattered tools, they have a single console designed for how they work.</p><div><hr></div><h4><strong>Prism templates: opinionated Spark jobs without the boilerplate</strong></h4><p>Spark&#8217;s flexibility is both a strength and a risk. The same workload can be written in very different ways which leads to wide differences in structure, resource usage and maintainability. In a large organisation, that turns into a support headache.</p><p>Prism Template is Snap&#8217;s way of putting structure on top of that flexibility. Instead of everyone writing full Spark applications, users define jobs in YAML using reusable modules and standardised patterns. The platform takes ownership of job bootstrap, configuration and core logic wiring.</p><p>This approach makes experimentation easier for ML engineers. They can assemble pipelines quickly without having to understand every detail of Spark&#8217;s internals. Later on, as jobs move toward production, teams can adjust centralised configurations and modules to scale more gracefully, without rewriting their application code.</p><p><strong>Why templates matter in practice</strong></p><p>The core benefits of this approach:</p><ul><li><p><strong>Simplified job authoring: </strong>Users describe jobs via a YAML file instead of writing low-level Spark boilerplate. The same definition can move from local development to staging and then production.</p></li><li><p><strong>High-quality, reusable components: </strong>The platform ships with modules that cover tasks like:</p><ul><li><p>Iceberg ingestion</p></li><li><p>Feature extraction</p></li><li><p>Sequence building and manipulation</p></li></ul><p>These modules encode best practices, which reduces user errors and keeps jobs consistent.</p></li><li><p><strong>Integrated observability and tooling: </strong>Because Prism owns the bootstrap and core modules, it can inject metrics, logging, and other operational hooks in a uniform way.</p></li><li><p><strong>Managed versioning: </strong>Job definitions, driver JARs, and plugin JARs are versioned and managed centrally. This supports safe upgrades and stable behavior across environments.</p></li><li><p><strong>Customisable templates: </strong>Users can start from pre-built templates, then add their own parameters or chain modules in new ways.</p></li></ul><p>The example below shows a Prism Template YAML snippet that combined two modules in one job: one that builds an ordered sequence column and one that ingests the result into an Iceberg table.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!4ziz!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7ea38cec-5aad-4034-9a08-8e37ac4dec38_1293x783.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!4ziz!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7ea38cec-5aad-4034-9a08-8e37ac4dec38_1293x783.png 424w, https://substackcdn.com/image/fetch/$s_!4ziz!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7ea38cec-5aad-4034-9a08-8e37ac4dec38_1293x783.png 848w, https://substackcdn.com/image/fetch/$s_!4ziz!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7ea38cec-5aad-4034-9a08-8e37ac4dec38_1293x783.png 1272w, https://substackcdn.com/image/fetch/$s_!4ziz!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7ea38cec-5aad-4034-9a08-8e37ac4dec38_1293x783.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!4ziz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7ea38cec-5aad-4034-9a08-8e37ac4dec38_1293x783.png" width="1293" height="783" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7ea38cec-5aad-4034-9a08-8e37ac4dec38_1293x783.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:783,&quot;width&quot;:1293,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:54191,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.datatinkerer.io/i/179211962?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7ea38cec-5aad-4034-9a08-8e37ac4dec38_1293x783.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!4ziz!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7ea38cec-5aad-4034-9a08-8e37ac4dec38_1293x783.png 424w, https://substackcdn.com/image/fetch/$s_!4ziz!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7ea38cec-5aad-4034-9a08-8e37ac4dec38_1293x783.png 848w, https://substackcdn.com/image/fetch/$s_!4ziz!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7ea38cec-5aad-4034-9a08-8e37ac4dec38_1293x783.png 1272w, https://substackcdn.com/image/fetch/$s_!4ziz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7ea38cec-5aad-4034-9a08-8e37ac4dec38_1293x783.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The structure is easy to read and the heavy logic lives in reusable modules.</p><div><hr></div><h4><strong>The Prism control plane: making Spark feel serverless</strong></h4><p>The first take on a control plane at Snap was lean. It wrapped <a href="https://docs.cloud.google.com/dataproc/docs/reference/rest">Dataproc APIs</a>, handled some permissions and exposed separate concepts for clusters and jobs. Orchestration tools such as Airflow still had to own cluster lifecycle, including creation, reuse, teardown and failure handling.</p><p>At small scale, that model works. As usage grows, human-managed cluster lifecycle turns into a liability. Teams end up carrying subtle differences in how they handle errors and retries. Operational load rises. Reliability drops.</p><p>The team redesigned the control plane around a different principle: one simple job submission interface that hides cluster lifecycle.</p><p>The redesigned system:</p><ul><li><p>Presents a single API endpoint for job submission</p></li><li><p>Internally handles cluster provisioning, monitoring, retries, and shutdown</p></li><li><p>Uses a workflow engine built on Temporal to orchestrate these steps</p></li></ul><p>The division of responsibilities is clear:</p><ul><li><p>The control plane manages metadata and configuration</p></li><li><p>Temporal workflows manage the runtime orchestration</p></li></ul><p>This improves reliability and reduces cognitive load for users. It also creates a base for smarter features such as autotuning and intelligent retry policies, since the platform now owns the whole lifecycle rather than just parts of it.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Xjuv!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0401a0c7-1c45-4787-ad78-5d4b191350e0_1365x1035.avif" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Xjuv!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0401a0c7-1c45-4787-ad78-5d4b191350e0_1365x1035.avif 424w, https://substackcdn.com/image/fetch/$s_!Xjuv!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0401a0c7-1c45-4787-ad78-5d4b191350e0_1365x1035.avif 848w, https://substackcdn.com/image/fetch/$s_!Xjuv!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0401a0c7-1c45-4787-ad78-5d4b191350e0_1365x1035.avif 1272w, https://substackcdn.com/image/fetch/$s_!Xjuv!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0401a0c7-1c45-4787-ad78-5d4b191350e0_1365x1035.avif 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Xjuv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0401a0c7-1c45-4787-ad78-5d4b191350e0_1365x1035.avif" width="1365" height="1035" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0401a0c7-1c45-4787-ad78-5d4b191350e0_1365x1035.avif&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1035,&quot;width&quot;:1365,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:31947,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/avif&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.datatinkerer.io/i/179211962?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0401a0c7-1c45-4787-ad78-5d4b191350e0_1365x1035.avif&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Xjuv!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0401a0c7-1c45-4787-ad78-5d4b191350e0_1365x1035.avif 424w, https://substackcdn.com/image/fetch/$s_!Xjuv!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0401a0c7-1c45-4787-ad78-5d4b191350e0_1365x1035.avif 848w, https://substackcdn.com/image/fetch/$s_!Xjuv!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0401a0c7-1c45-4787-ad78-5d4b191350e0_1365x1035.avif 1272w, https://substackcdn.com/image/fetch/$s_!Xjuv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0401a0c7-1c45-4787-ad78-5d4b191350e0_1365x1035.avif 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Prism control plane (Source: Snap)</figcaption></figure></div><div><hr></div><h4><strong>Metrics and autotuning: turning Spark signals into smarter defaults</strong></h4><p>Spark and Dataproc expose a large amount of metrics out of the box. The problem is not availability, it is usability.</p><p>The raw metrics:</p><ul><li><p>Exist at multiple levels: job, stage, task, cluster</p></li><li><p>Are not always structured for time-series analysis</p></li><li><p>Use inconsistent naming and retention policies across sources</p></li><li><p>Are difficult to use as inputs for automation</p></li></ul><p>To fix this, Snap built a dedicated metrics system for Spark workloads in Prism.</p><p>This system:</p><ul><li><p>Ingests selected signals from jobs, clusters, and infrastructure</p></li><li><p>Normalizes them into a coherent schema</p></li><li><p>Stores them in a centralized Spanner database for durability and consistency</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!TpXl!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6149b1a6-ba47-4779-ba1a-9749453d186f_1999x740.avif" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!TpXl!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6149b1a6-ba47-4779-ba1a-9749453d186f_1999x740.avif 424w, https://substackcdn.com/image/fetch/$s_!TpXl!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6149b1a6-ba47-4779-ba1a-9749453d186f_1999x740.avif 848w, https://substackcdn.com/image/fetch/$s_!TpXl!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6149b1a6-ba47-4779-ba1a-9749453d186f_1999x740.avif 1272w, https://substackcdn.com/image/fetch/$s_!TpXl!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6149b1a6-ba47-4779-ba1a-9749453d186f_1999x740.avif 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!TpXl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6149b1a6-ba47-4779-ba1a-9749453d186f_1999x740.avif" width="1456" height="539" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6149b1a6-ba47-4779-ba1a-9749453d186f_1999x740.avif&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:539,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:28905,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/avif&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.datatinkerer.io/i/179211962?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6149b1a6-ba47-4779-ba1a-9749453d186f_1999x740.avif&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!TpXl!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6149b1a6-ba47-4779-ba1a-9749453d186f_1999x740.avif 424w, https://substackcdn.com/image/fetch/$s_!TpXl!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6149b1a6-ba47-4779-ba1a-9749453d186f_1999x740.avif 848w, https://substackcdn.com/image/fetch/$s_!TpXl!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6149b1a6-ba47-4779-ba1a-9749453d186f_1999x740.avif 1272w, https://substackcdn.com/image/fetch/$s_!TpXl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6149b1a6-ba47-4779-ba1a-9749453d186f_1999x740.avif 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Prism metric architecture (Source: Snap)</figcaption></figure></div><p>With this foundation, Prism can:</p><ul><li><p>Show actionable metrics in the UI console</p></li><li><p>Back features like autoscaling and intelligent retries</p></li><li><p>Support autotuning features that adjust configuration based on observed behavior</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!wh4a!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe16b71b3-e0c7-4f1c-bf75-ca2eb85fd0c7_1065x800.avif" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!wh4a!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe16b71b3-e0c7-4f1c-bf75-ca2eb85fd0c7_1065x800.avif 424w, https://substackcdn.com/image/fetch/$s_!wh4a!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe16b71b3-e0c7-4f1c-bf75-ca2eb85fd0c7_1065x800.avif 848w, https://substackcdn.com/image/fetch/$s_!wh4a!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe16b71b3-e0c7-4f1c-bf75-ca2eb85fd0c7_1065x800.avif 1272w, https://substackcdn.com/image/fetch/$s_!wh4a!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe16b71b3-e0c7-4f1c-bf75-ca2eb85fd0c7_1065x800.avif 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!wh4a!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe16b71b3-e0c7-4f1c-bf75-ca2eb85fd0c7_1065x800.avif" width="1065" height="800" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e16b71b3-e0c7-4f1c-bf75-ca2eb85fd0c7_1065x800.avif&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:800,&quot;width&quot;:1065,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:24456,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/avif&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.datatinkerer.io/i/179211962?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe16b71b3-e0c7-4f1c-bf75-ca2eb85fd0c7_1065x800.avif&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!wh4a!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe16b71b3-e0c7-4f1c-bf75-ca2eb85fd0c7_1065x800.avif 424w, https://substackcdn.com/image/fetch/$s_!wh4a!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe16b71b3-e0c7-4f1c-bf75-ca2eb85fd0c7_1065x800.avif 848w, https://substackcdn.com/image/fetch/$s_!wh4a!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe16b71b3-e0c7-4f1c-bf75-ca2eb85fd0c7_1065x800.avif 1272w, https://substackcdn.com/image/fetch/$s_!wh4a!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe16b71b3-e0c7-4f1c-bf75-ca2eb85fd0c7_1065x800.avif 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Prism metrics UI (Source: Snap)</figcaption></figure></div><p>Importantly, this metrics work was done in parallel with the UI console, so both evolved together. The result is a unified experience where what the user sees and what the automation uses come from the same underlying system.</p><div><hr></div><h4><strong>How Prism spread across Snap</strong></h4><p>The clearest sign that a platform is working is usage. Prism&#8217;s daily job counts have climbed from single digits to several thousand per day, with peaks above 10,000 jobs.</p><p>The pattern of adoption falls into two main buckets.</p><ol><li><p><strong>Direct use by advanced Spark teams: </strong>Teams with complex Spark needs use Prism directly. Their workloads often involve:</p></li></ol><ul><li><p>Large joins</p></li><li><p>Tight coupling with specific data models</p></li><li><p>Custom logic that does not fit into a narrow &#8220;standard pipeline&#8221; box</p></li></ul><p>These teams still get value from Prism&#8217;s abstractions and control plane, but they stay close to the underlying capabilities.</p><ol start="2"><li><p><strong>Integration into internal platforms: </strong>Other teams do not think in terms of Spark at all. They work with internal tools for:</p></li></ol><ul><li><p>ML data preparation</p></li><li><p>Feature engineering</p></li><li><p>Experimentation</p></li></ul><p>Those tools, in turn, embed Prism. The teams&#8217; users get domain-specific interfaces, while Prism quietly runs the heavy Spark work in the background.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!WJDg!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F146482cb-18cf-4df7-94aa-1868ed8aaae0_1390x800.avif" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!WJDg!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F146482cb-18cf-4df7-94aa-1868ed8aaae0_1390x800.avif 424w, https://substackcdn.com/image/fetch/$s_!WJDg!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F146482cb-18cf-4df7-94aa-1868ed8aaae0_1390x800.avif 848w, https://substackcdn.com/image/fetch/$s_!WJDg!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F146482cb-18cf-4df7-94aa-1868ed8aaae0_1390x800.avif 1272w, https://substackcdn.com/image/fetch/$s_!WJDg!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F146482cb-18cf-4df7-94aa-1868ed8aaae0_1390x800.avif 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!WJDg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F146482cb-18cf-4df7-94aa-1868ed8aaae0_1390x800.avif" width="1390" height="800" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/146482cb-18cf-4df7-94aa-1868ed8aaae0_1390x800.avif&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:800,&quot;width&quot;:1390,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:21651,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/avif&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.datatinkerer.io/i/179211962?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F146482cb-18cf-4df7-94aa-1868ed8aaae0_1390x800.avif&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!WJDg!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F146482cb-18cf-4df7-94aa-1868ed8aaae0_1390x800.avif 424w, https://substackcdn.com/image/fetch/$s_!WJDg!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F146482cb-18cf-4df7-94aa-1868ed8aaae0_1390x800.avif 848w, https://substackcdn.com/image/fetch/$s_!WJDg!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F146482cb-18cf-4df7-94aa-1868ed8aaae0_1390x800.avif 1272w, https://substackcdn.com/image/fetch/$s_!WJDg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F146482cb-18cf-4df7-94aa-1868ed8aaae0_1390x800.avif 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Supporting both direct and embedded use is a crucial design choice. It lets Prism spread across Snap without forcing every user into the same interface or abstraction level.</p><div><hr></div><h3>The full scoop</h3><p>To learn more about this, check <a href="https://eng.snap.com/prism">Snapchat's Engineering Blog</a> post on this topic</p><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.datatinkerer.io/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">If you liked this post and don&#8217;t want to miss the next one, subscribe to Data Tinkerer!</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p>If you are already subscribed and enjoyed the article, please give it a like and/or share it others, really appreciate it &#128591;</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.datatinkerer.io/p/how-snap-rebuilt-its-ml-platform-to-handle-10000-daily-spark-jobs?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.datatinkerer.io/p/how-snap-rebuilt-its-ml-platform-to-handle-10000-daily-spark-jobs?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p><div><hr></div><h3>Keep learning</h3><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;8b717c15-913f-4e54-91a7-fb3f26e15721&quot;,&quot;caption&quot;:&quot;How do you keep data fresh for millions of merchants when you&#8217;re streaming from 100+ MySQL shards?<br /><br />Shopify&#8217;s answer: a 400TB Change Data Capture platform that pushes up to 100k events a second.<br /><br />This post dives into the trade-offs, the challenges and the lessons learned from building CDC at scale.&quot;,&quot;cta&quot;:&quot;Read full story&quot;,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;lg&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;How Shopify Uses Change Data Capture to Serve Millions of Merchants&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:291590442,&quot;name&quot;:&quot;Data Tinkerer&quot;,&quot;bio&quot;:&quot;Ex-head of analytics sharing deep dives and learnings about AI and all things data (science, engineering, analysis)&quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/83d3bbe0-5fb8-4f8d-9b74-036abbd6fec9_500x500.png&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2025-09-18T07:53:42.206Z&quot;,&quot;cover_image&quot;:&quot;https://images.unsplash.com/photo-1730818874996-dea4bddf5554?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHw1fHxzaG9waWZ5fGVufDB8fHx8MTc1ODE4MDY0NHww&amp;ixlib=rb-4.1.0&amp;q=80&amp;w=1080&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://www.datatinkerer.io/p/how-shopify-uses-change-data-capture-to-serve-millions-of-merchants&quot;,&quot;section_name&quot;:&quot;Data Engineering&quot;,&quot;video_upload_id&quot;:null,&quot;id&quot;:173822667,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:10,&quot;comment_count&quot;:0,&quot;publication_id&quot;:3422740,&quot;publication_name&quot;:&quot;Data Tinkerer&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!JEdj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6bea5ccd-f356-4154-a1db-16268300510e_500x500.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;885df221-54f0-4845-8a4f-af4f0d0e2648&quot;,&quot;caption&quot;:&quot;Cold starts, version drift and clunky notebooks, Grab hit all the classic headaches of streaming at scale.<br /><br />Here&#8217;s how they fixed it with FlinkSQL + Kafka.&quot;,&quot;cta&quot;:&quot;Read full story&quot;,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;lg&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;How Grab Shrunk Real-Time Queries from 5 Minutes to 1 with FlinkSQL and Kafka&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:291590442,&quot;name&quot;:&quot;Data Tinkerer&quot;,&quot;bio&quot;:&quot;Ex-head of analytics sharing deep dives and learnings about AI and all things data (science, engineering, analysis)&quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/83d3bbe0-5fb8-4f8d-9b74-036abbd6fec9_500x500.png&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2025-08-21T06:45:35.215Z&quot;,&quot;cover_image&quot;:&quot;https://images.unsplash.com/photo-1587476351660-e9fa4bb8b26c?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwyfHxncmFifGVufDB8fHx8MTc1NTQ5OTAzOHww&amp;ixlib=rb-4.1.0&amp;q=80&amp;w=1080&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://www.datatinkerer.io/p/how-grab-shrunk-real-time-queries-from-five-minutes-to-one&quot;,&quot;section_name&quot;:&quot;Data Engineering&quot;,&quot;video_upload_id&quot;:null,&quot;id&quot;:171226398,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:4,&quot;comment_count&quot;:0,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;Data Tinkerer&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!JEdj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6bea5ccd-f356-4154-a1db-16268300510e_500x500.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div>]]></content:encoded></item><item><title><![CDATA[From Marketing to Data Engineering: How I Made the Switch]]></title><description><![CDATA[How one marketer followed the trail of tracking pixels into pipelines and built a career turning messy data into usable systems.]]></description><link>https://www.datatinkerer.io/p/from-marketing-to-data-engineering-how-i-made-the-switch</link><guid isPermaLink="false">https://www.datatinkerer.io/p/from-marketing-to-data-engineering-how-i-made-the-switch</guid><dc:creator><![CDATA[Data Tinkerer]]></dc:creator><pubDate>Thu, 16 Oct 2025 04:01:28 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/9f252334-7437-40b5-9f82-08c981de2f6d_761x764.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Fellow Data Tinkerers,</p><p>Lately, I&#8217;ve been thinking about starting a new series where people working in data share how they got here, what they&#8217;ve learned along the way and what their day-to-day looks like.</p><p>So, I&#8217;m kicking it off today with <span class="mention-wrap" data-attrs="{&quot;name&quot;:&quot;Alejandro Aboy&quot;,&quot;id&quot;:22949723,&quot;type&quot;:&quot;user&quot;,&quot;url&quot;:null,&quot;photo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!u1Ao!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdca2c63d-9f5e-4cd3-99ac-7d8e71dc114b_1024x1024.jpeg&quot;,&quot;uuid&quot;:&quot;ee6ab692-9d69-4714-8986-9b599e2b5557&quot;}" data-component-name="MentionToDOM"></span>, Senior Data Engineer at Workpath and writer of <em>The Pipe and The Line</em> newsletter.</p><div class="embedded-publication-wrap" data-attrs="{&quot;id&quot;:1196229,&quot;name&quot;:&quot;The Pipe &amp; The Line&quot;,&quot;logo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!vmrQ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d5b2131-da28-4621-ad6f-9574cbc41a1e_500x500.png&quot;,&quot;base_url&quot;:&quot;https://thepipeandtheline.substack.com&quot;,&quot;hero_text&quot;:&quot;Hands-on guides, tools, and experiments to sharpen your Data &amp; AI Engineering skills from someone who learned it all in the wild.&quot;,&quot;author_name&quot;:&quot;Alejandro Aboy&quot;,&quot;show_subscribe&quot;:true,&quot;logo_bg_color&quot;:&quot;#131826&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="EmbeddedPublicationToDOMWithSubscribe"><div class="embedded-publication show-subscribe"><a class="embedded-publication-link-part" native="true" href="https://thepipeandtheline.substack.com?utm_source=substack&amp;utm_campaign=publication_embed&amp;utm_medium=web"><img class="embedded-publication-logo" src="https://substackcdn.com/image/fetch/$s_!vmrQ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d5b2131-da28-4621-ad6f-9574cbc41a1e_500x500.png" width="56" height="56" style="background-color: rgb(19, 24, 38);"><span class="embedded-publication-name">The Pipe &amp; The Line</span><div class="embedded-publication-hero-text">Hands-on guides, tools, and experiments to sharpen your Data &amp; AI Engineering skills from someone who learned it all in the wild.</div><div class="embedded-publication-author-name">By Alejandro Aboy</div></a><form class="embedded-publication-subscribe" method="GET" action="https://thepipeandtheline.substack.com/subscribe?"><input type="hidden" name="source" value="publication-embed"><input type="hidden" name="autoSubmit" value="true"><input type="email" class="email-input" name="email" placeholder="Type your email..."><input type="submit" class="button primary" value="Subscribe"></form></div></div><p>We talked about how he went from marketing to data engineering, what his workflow looks like, why he was called an <em>octopus</em> and why he thinks &#8220;big data&#8221; is a fool&#8217;s errand for most teams.</p><p>So without further ado, let&#8217;s get into it!</p>
      <p>
          <a href="https://www.datatinkerer.io/p/from-marketing-to-data-engineering-how-i-made-the-switch">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[How Shopify Uses Change Data Capture to Serve Millions of Merchants]]></title><description><![CDATA[From batch queries to streaming 100k records per second during peak load]]></description><link>https://www.datatinkerer.io/p/how-shopify-uses-change-data-capture-to-serve-millions-of-merchants</link><guid isPermaLink="false">https://www.datatinkerer.io/p/how-shopify-uses-change-data-capture-to-serve-millions-of-merchants</guid><dc:creator><![CDATA[Data Tinkerer]]></dc:creator><pubDate>Thu, 18 Sep 2025 07:53:42 GMT</pubDate><enclosure url="https://images.unsplash.com/photo-1730818874996-dea4bddf5554?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHw1fHxzaG9waWZ5fGVufDB8fHx8MTc1ODE4MDY0NHww&amp;ixlib=rb-4.1.0&amp;q=80&amp;w=1080" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Fellow Data Tinkerers!</p><p>Today we will look at how Shopify built a real-time data pipeline at 400TB scale.</p><p>But before that, I wanted to share with you what you could unlock if you share Data Tinkerer with just <strong>1 more person</strong>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!K9fn!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1ed777b-9a73-44d8-9095-d6c0df4aa853_800x402.gif" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!K9fn!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1ed777b-9a73-44d8-9095-d6c0df4aa853_800x402.gif 424w, https://substackcdn.com/image/fetch/$s_!K9fn!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1ed777b-9a73-44d8-9095-d6c0df4aa853_800x402.gif 848w, https://substackcdn.com/image/fetch/$s_!K9fn!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1ed777b-9a73-44d8-9095-d6c0df4aa853_800x402.gif 1272w, https://substackcdn.com/image/fetch/$s_!K9fn!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1ed777b-9a73-44d8-9095-d6c0df4aa853_800x402.gif 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!K9fn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1ed777b-9a73-44d8-9095-d6c0df4aa853_800x402.gif" width="800" height="402" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a1ed777b-9a73-44d8-9095-d6c0df4aa853_800x402.gif&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:402,&quot;width&quot;:800,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:3369150,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/gif&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.datatinkerer.io/i/173822667?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1ed777b-9a73-44d8-9095-d6c0df4aa853_800x402.gif&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!K9fn!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1ed777b-9a73-44d8-9095-d6c0df4aa853_800x402.gif 424w, https://substackcdn.com/image/fetch/$s_!K9fn!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1ed777b-9a73-44d8-9095-d6c0df4aa853_800x402.gif 848w, https://substackcdn.com/image/fetch/$s_!K9fn!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1ed777b-9a73-44d8-9095-d6c0df4aa853_800x402.gif 1272w, https://substackcdn.com/image/fetch/$s_!K9fn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1ed777b-9a73-44d8-9095-d6c0df4aa853_800x402.gif 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>There are 100+ resources to learn all things data (science, engineering, analysis). It includes videos, courses, projects and can be filtered by tech stack (Python, SQL, Spark and etc), skill level (Beginner, Intermediate and so on) provider name or free/paid. So if you know other people who like staying up to date on all things data, please share Data Tinkerer with them!</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.datatinkerer.io/leaderboard?&amp;utm_source=post&quot;,&quot;text&quot;:&quot;Refer a friend&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.datatinkerer.io/leaderboard?&amp;utm_source=post"><span>Refer a friend</span></a></p><p>Now, with that out of the way, let&#8217;s get to change data capture at Shopify!</p>
      <p>
          <a href="https://www.datatinkerer.io/p/how-shopify-uses-change-data-capture-to-serve-millions-of-merchants">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[How Grab Shrunk Real-Time Queries from 5 Minutes to 1 with FlinkSQL and Kafka]]></title><description><![CDATA[With SQL as the interface, analysts and engineers can now explore streams and deploy pipelines in under 10 minutes.]]></description><link>https://www.datatinkerer.io/p/how-grab-shrunk-real-time-queries-from-five-minutes-to-one</link><guid isPermaLink="false">https://www.datatinkerer.io/p/how-grab-shrunk-real-time-queries-from-five-minutes-to-one</guid><dc:creator><![CDATA[Data Tinkerer]]></dc:creator><pubDate>Thu, 21 Aug 2025 06:45:35 GMT</pubDate><enclosure url="https://images.unsplash.com/photo-1587476351660-e9fa4bb8b26c?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwyfHxncmFifGVufDB8fHx8MTc1NTQ5OTAzOHww&amp;ixlib=rb-4.1.0&amp;q=80&amp;w=1080" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Fellow Data Tinkerers!</p><p>Today we will look at how Grab made real-time processing faster for its users. </p><p>But before that, I wanted to share with you what you could unlock if you share Data Tinkerer with just <strong>1 more person</strong>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!4JBO!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc192a623-a624-4963-a1d5-a293bf3f7e08_800x402.gif" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!4JBO!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc192a623-a624-4963-a1d5-a293bf3f7e08_800x402.gif 424w, https://substackcdn.com/image/fetch/$s_!4JBO!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc192a623-a624-4963-a1d5-a293bf3f7e08_800x402.gif 848w, https://substackcdn.com/image/fetch/$s_!4JBO!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc192a623-a624-4963-a1d5-a293bf3f7e08_800x402.gif 1272w, https://substackcdn.com/image/fetch/$s_!4JBO!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc192a623-a624-4963-a1d5-a293bf3f7e08_800x402.gif 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!4JBO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc192a623-a624-4963-a1d5-a293bf3f7e08_800x402.gif" width="800" height="402" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c192a623-a624-4963-a1d5-a293bf3f7e08_800x402.gif&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:402,&quot;width&quot;:800,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:3369150,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/gif&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.datatinkerer.io/i/171226398?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc192a623-a624-4963-a1d5-a293bf3f7e08_800x402.gif&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!4JBO!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc192a623-a624-4963-a1d5-a293bf3f7e08_800x402.gif 424w, https://substackcdn.com/image/fetch/$s_!4JBO!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc192a623-a624-4963-a1d5-a293bf3f7e08_800x402.gif 848w, https://substackcdn.com/image/fetch/$s_!4JBO!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc192a623-a624-4963-a1d5-a293bf3f7e08_800x402.gif 1272w, https://substackcdn.com/image/fetch/$s_!4JBO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc192a623-a624-4963-a1d5-a293bf3f7e08_800x402.gif 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>There are 100+ resources to learn all things data (science, engineering, analysis). It includes videos, courses, projects and can be filtered by tech stack (Python, SQL, Spark and etc), skill level (Beginner, Intermediate and so on) provider name or free/paid. So if you know other people who like staying up to date on all things data, please share Data Tinkerer with them!</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.datatinkerer.io/leaderboard?&amp;utm_source=post&quot;,&quot;text&quot;:&quot;Refer a friend&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.datatinkerer.io/leaderboard?&amp;utm_source=post"><span>Refer a friend</span></a></p><p>Now, with that out of the way, let&#8217;s get to real-time processing at Grab!</p>
      <p>
          <a href="https://www.datatinkerer.io/p/how-grab-shrunk-real-time-queries-from-five-minutes-to-one">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[How Expedia Monitors 1000+ A/B Tests in Real Time with Flink and Kafka]]></title><description><![CDATA[A look inside the pipeline that spots underperforming experiments in minutes and not days]]></description><link>https://www.datatinkerer.io/p/how-expedia-monitors-1000-ab-tests-in-real-time-with-flink-and-kafka</link><guid isPermaLink="false">https://www.datatinkerer.io/p/how-expedia-monitors-1000-ab-tests-in-real-time-with-flink-and-kafka</guid><dc:creator><![CDATA[Data Tinkerer]]></dc:creator><pubDate>Thu, 24 Jul 2025 07:55:43 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!jZ_b!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F961833e9-d13b-4216-a74d-8aebaa3c9fc1_1024x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Fellow Data Tinkerers!</p><p>Today we will look at how Expedia Group monitors A/B tests at a large scale</p><p>But before that, I wanted to share an example of what you could unlock if you share Data Tinkerer with just <strong>2 other people </strong>and they subscribe to the newsletter.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!b1ZD!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d63d619-0614-4aba-ae5d-81e6863b4663_1650x1275.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!b1ZD!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d63d619-0614-4aba-ae5d-81e6863b4663_1650x1275.jpeg 424w, https://substackcdn.com/image/fetch/$s_!b1ZD!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d63d619-0614-4aba-ae5d-81e6863b4663_1650x1275.jpeg 848w, https://substackcdn.com/image/fetch/$s_!b1ZD!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d63d619-0614-4aba-ae5d-81e6863b4663_1650x1275.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!b1ZD!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d63d619-0614-4aba-ae5d-81e6863b4663_1650x1275.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!b1ZD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d63d619-0614-4aba-ae5d-81e6863b4663_1650x1275.jpeg" width="1456" height="1125" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5d63d619-0614-4aba-ae5d-81e6863b4663_1650x1275.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1125,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1291918,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.datatinkerer.io/i/169094273?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d63d619-0614-4aba-ae5d-81e6863b4663_1650x1275.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!b1ZD!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d63d619-0614-4aba-ae5d-81e6863b4663_1650x1275.jpeg 424w, https://substackcdn.com/image/fetch/$s_!b1ZD!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d63d619-0614-4aba-ae5d-81e6863b4663_1650x1275.jpeg 848w, https://substackcdn.com/image/fetch/$s_!b1ZD!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d63d619-0614-4aba-ae5d-81e6863b4663_1650x1275.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!b1ZD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d63d619-0614-4aba-ae5d-81e6863b4663_1650x1275.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>There are 100+ more cheat sheets covering everything from Python, R, SQL, Spark to Power BI, Tableau, Git and many more. So if you know other people who like staying up to date on all things data, please share Data Tinkerer with them!</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.datatinkerer.io/leaderboard?&amp;referrer_token=4tlsmi&amp;utm_source=post&quot;,&quot;text&quot;:&quot;Refer a friend&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.datatinkerer.io/leaderboard?&amp;referrer_token=4tlsmi&amp;utm_source=post"><span>Refer a friend</span></a></p><p>Now, with that out of the way, let&#8217;s get the real-time monitoring of A/B tests by Expedia</p>
      <p>
          <a href="https://www.datatinkerer.io/p/how-expedia-monitors-1000-ab-tests-in-real-time-with-flink-and-kafka">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[How Bolt Reconciles €2B in Revenue Using Airflow, Spark and dbt]]></title><description><![CDATA[A look under the hood of a multi-country finance pipeline that ingests raw data, models discrepancies and reconciles cash flows at scale.]]></description><link>https://www.datatinkerer.io/p/how-bolt-reconciles-2b-in-revenue-using-airflow-spark-and-dbt</link><guid isPermaLink="false">https://www.datatinkerer.io/p/how-bolt-reconciles-2b-in-revenue-using-airflow-spark-and-dbt</guid><dc:creator><![CDATA[Data Tinkerer]]></dc:creator><pubDate>Thu, 03 Jul 2025 09:53:52 GMT</pubDate><enclosure url="https://images.unsplash.com/photo-1715351123666-6a9c4f180c54?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHw1MXx8Ym9sdHxlbnwwfHx8fDE3NTE0MzU2NjN8MA&amp;ixlib=rb-4.1.0&amp;q=80&amp;w=1080" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Fellow Data Tinkerers!</p><p>Today we will look at how Bolt tracks payments at scale.</p><p>But before that, I wanted to share an example of what you could unlock if you share Data Tinkerer with just <strong>2 other people </strong>and they subscribe to the newsletter.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ufW-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea22c22d-33fa-419b-a1db-9e654a2510d6_1650x1275.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ufW-!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea22c22d-33fa-419b-a1db-9e654a2510d6_1650x1275.jpeg 424w, https://substackcdn.com/image/fetch/$s_!ufW-!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea22c22d-33fa-419b-a1db-9e654a2510d6_1650x1275.jpeg 848w, https://substackcdn.com/image/fetch/$s_!ufW-!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea22c22d-33fa-419b-a1db-9e654a2510d6_1650x1275.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!ufW-!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea22c22d-33fa-419b-a1db-9e654a2510d6_1650x1275.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ufW-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea22c22d-33fa-419b-a1db-9e654a2510d6_1650x1275.jpeg" width="1456" height="1125" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ea22c22d-33fa-419b-a1db-9e654a2510d6_1650x1275.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1125,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1357928,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.datatinkerer.io/i/167406813?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea22c22d-33fa-419b-a1db-9e654a2510d6_1650x1275.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ufW-!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea22c22d-33fa-419b-a1db-9e654a2510d6_1650x1275.jpeg 424w, https://substackcdn.com/image/fetch/$s_!ufW-!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea22c22d-33fa-419b-a1db-9e654a2510d6_1650x1275.jpeg 848w, https://substackcdn.com/image/fetch/$s_!ufW-!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea22c22d-33fa-419b-a1db-9e654a2510d6_1650x1275.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!ufW-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea22c22d-33fa-419b-a1db-9e654a2510d6_1650x1275.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>There are 100+ more cheat sheets covering everything from Python, R, SQL, Spark to Power BI, Tableau, Git and many more. So if you know other people who like staying up to date on all things data, please share Data Tinkerer with them!</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.datatinkerer.io/leaderboard?&amp;referrer_token=4tlsmi&amp;utm_source=post&quot;,&quot;text&quot;:&quot;Refer a friend&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.datatinkerer.io/leaderboard?&amp;referrer_token=4tlsmi&amp;utm_source=post"><span>Refer a friend</span></a></p><p>Now, with that out of the way, let&#8217;s get to how Bolt deals with processing payments from millions of customers</p>
      <p>
          <a href="https://www.datatinkerer.io/p/how-bolt-reconciles-2b-in-revenue-using-airflow-spark-and-dbt">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[How Flipkart Scaled Delivery Date Calculation 10x While Slashing Latency by 90%]]></title><description><![CDATA[Optimising for 100 items in 100ms without breaking the backend (or the bank)]]></description><link>https://www.datatinkerer.io/p/how-flipkart-scaled-delivery-date-calculation-10x-while-slashing-latency-by-90-percent</link><guid isPermaLink="false">https://www.datatinkerer.io/p/how-flipkart-scaled-delivery-date-calculation-10x-while-slashing-latency-by-90-percent</guid><dc:creator><![CDATA[Data Tinkerer]]></dc:creator><pubDate>Fri, 13 Jun 2025 09:10:50 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e325e31-1188-4171-a6d4-9d88b490de17_1024x1536.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Fellow Data Tinkerers!</p><p>Today we will look at how Flipkart solved the problem of calculating delivery date.</p><p>But before that, I wanted to share an example of what you could unlock if you share Data Tinkerer with just <strong>2 other people</strong>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!-CMo!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fadf294af-e15c-41c8-9dfd-c0541054526b_3300x2550.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!-CMo!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fadf294af-e15c-41c8-9dfd-c0541054526b_3300x2550.jpeg 424w, https://substackcdn.com/image/fetch/$s_!-CMo!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fadf294af-e15c-41c8-9dfd-c0541054526b_3300x2550.jpeg 848w, https://substackcdn.com/image/fetch/$s_!-CMo!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fadf294af-e15c-41c8-9dfd-c0541054526b_3300x2550.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!-CMo!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fadf294af-e15c-41c8-9dfd-c0541054526b_3300x2550.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!-CMo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fadf294af-e15c-41c8-9dfd-c0541054526b_3300x2550.jpeg" width="1456" height="1125" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/adf294af-e15c-41c8-9dfd-c0541054526b_3300x2550.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1125,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:2910331,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.datatinkerer.io/i/165748507?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fadf294af-e15c-41c8-9dfd-c0541054526b_3300x2550.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!-CMo!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fadf294af-e15c-41c8-9dfd-c0541054526b_3300x2550.jpeg 424w, https://substackcdn.com/image/fetch/$s_!-CMo!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fadf294af-e15c-41c8-9dfd-c0541054526b_3300x2550.jpeg 848w, https://substackcdn.com/image/fetch/$s_!-CMo!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fadf294af-e15c-41c8-9dfd-c0541054526b_3300x2550.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!-CMo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fadf294af-e15c-41c8-9dfd-c0541054526b_3300x2550.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>There are 100+ more cheat sheets covering everything from Python, R, SQL, Spark to Power BI, Tableau, Git and many more. So if you know other people who like staying up to date on all things data, please share Data Tinkerer with them!</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.datatinkerer.io/leaderboard?&amp;referrer_token=4tlsmi&amp;utm_source=post&quot;,&quot;text&quot;:&quot;Refer a friend&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.datatinkerer.io/leaderboard?&amp;referrer_token=4tlsmi&amp;utm_source=post"><span>Refer a friend</span></a></p><p>Now, with that out of the way, let&#8217;s get to Flipkart&#8217;s challenge of calculating delivery date at scale</p>
      <p>
          <a href="https://www.datatinkerer.io/p/how-flipkart-scaled-delivery-date-calculation-10x-while-slashing-latency-by-90-percent">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[How Notion Brought Order to Its Data Chaos (And Why Their First Catalog Failed)]]></title><description><![CDATA[A behind-the-scenes look at the real challenges, missed steps and what finally made their data catalog work.]]></description><link>https://www.datatinkerer.io/p/how-notion-brought-order-to-its-data-chaos</link><guid isPermaLink="false">https://www.datatinkerer.io/p/how-notion-brought-order-to-its-data-chaos</guid><dc:creator><![CDATA[Data Tinkerer]]></dc:creator><pubDate>Thu, 22 May 2025 09:07:34 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!6QIL!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F072d6737-aa00-471d-9ee2-b95adfcb8012_2520x1323.avif" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Fellow Data Tinkerers!</p><p>Quick note before this week&#8217;s deep dive. Thanks for reading and subscribing, I really do mean it. If you&#8217;ve got feedback, just hit reply. I read every response.</p><p>Data Tinkerer has always been about sharing what actually works in data, beyond just tools and tech. The deep dives will keep coming but I want to start spotlighting the stuff we don&#8217;t talk about enough: the day-to-day challenges, business outcomes, the challenges and the learnings.</p><p>I want to feature stories from people in data roles: senior data engineers, lead analysts, heads of data, you name it. If you&#8217;ve got a story, lesson, recent technical win or even a battle scar from the data trenches, let&#8217;s get it in front of almost 1,000 smart and engaged peers.</p><p>You don&#8217;t need to be a &#8220;writer&#8221;, I&#8217;ll help your story shine. Plus, guest contributors get a shoutout in the newsletter and on LinkedIn (If you want).</p><p><strong>Keen to share your data story? Just reply to this email or message me on Substack and we&#8217;ll tee it up.</strong></p><div class="directMessage button" data-attrs="{&quot;userId&quot;:291590442,&quot;userName&quot;:&quot;Data Tinkerer&quot;,&quot;canDm&quot;:null,&quot;dmUpgradeOptions&quot;:null,&quot;isEditorNode&quot;:true}" data-component-name="DirectMessageToDOM"></div><p>Now &#8230;</p>
      <p>
          <a href="https://www.datatinkerer.io/p/how-notion-brought-order-to-its-data-chaos">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[How Canva Rebuilt Its Data Pipelines for Billions of Events per Month]]></title><description><![CDATA[What it takes to track usage, pay creators fairly and not drown in incident recovery hell.]]></description><link>https://www.datatinkerer.io/p/how-canva-rebuilt-its-data-pipelines-for-billions-of-events-per-month</link><guid isPermaLink="false">https://www.datatinkerer.io/p/how-canva-rebuilt-its-data-pipelines-for-billions-of-events-per-month</guid><dc:creator><![CDATA[Data Tinkerer]]></dc:creator><pubDate>Thu, 01 May 2025 07:39:39 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4515c9b-2ca5-4dec-a2b4-98c5358d205a_1920x1080.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Fellow Data Tinkerers!</p><p>Today we will look at how Canva solved the surprisingly messy problem of counting at scale</p><p>But before that, I wanted to share an example of what you could unlock if you share Data Tinkerer with just <strong>2 other people</strong>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!46-k!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd46a3f23-4582-459f-8820-6b06a7d85138_3300x2550.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!46-k!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd46a3f23-4582-459f-8820-6b06a7d85138_3300x2550.jpeg 424w, https://substackcdn.com/image/fetch/$s_!46-k!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd46a3f23-4582-459f-8820-6b06a7d85138_3300x2550.jpeg 848w, https://substackcdn.com/image/fetch/$s_!46-k!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd46a3f23-4582-459f-8820-6b06a7d85138_3300x2550.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!46-k!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd46a3f23-4582-459f-8820-6b06a7d85138_3300x2550.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!46-k!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd46a3f23-4582-459f-8820-6b06a7d85138_3300x2550.jpeg" width="1456" height="1125" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d46a3f23-4582-459f-8820-6b06a7d85138_3300x2550.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1125,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:2470943,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.datatinkerer.io/i/162515057?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd46a3f23-4582-459f-8820-6b06a7d85138_3300x2550.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!46-k!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd46a3f23-4582-459f-8820-6b06a7d85138_3300x2550.jpeg 424w, https://substackcdn.com/image/fetch/$s_!46-k!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd46a3f23-4582-459f-8820-6b06a7d85138_3300x2550.jpeg 848w, https://substackcdn.com/image/fetch/$s_!46-k!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd46a3f23-4582-459f-8820-6b06a7d85138_3300x2550.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!46-k!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd46a3f23-4582-459f-8820-6b06a7d85138_3300x2550.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>There are 100+ more cheat sheets covering everything from Python, R, SQL, Spark to Power BI, Tableau, Git and many more. So if you know other people who like staying up to date on all things data, please share Data Tinkerer with them!</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.datatinkerer.io/leaderboard?&amp;referrer_token=4tlsmi&amp;utm_source=post&quot;,&quot;text&quot;:&quot;Refer a friend&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.datatinkerer.io/leaderboard?&amp;referrer_token=4tlsmi&amp;utm_source=post"><span>Refer a friend</span></a></p><p>Now, with that out of the way, let&#8217;s get to Canva&#8217;s work to build a scalable data pipeline</p>
      <p>
          <a href="https://www.datatinkerer.io/p/how-canva-rebuilt-its-data-pipelines-for-billions-of-events-per-month">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[How Airtable Made Archive Validation Work at Petabyte Scale]]></title><description><![CDATA[They handled billions of rows using joins, hashes and a lot of buckets.]]></description><link>https://www.datatinkerer.io/p/how-airtable-made-archive-validation-work-at-petabyte-scale</link><guid isPermaLink="false">https://www.datatinkerer.io/p/how-airtable-made-archive-validation-work-at-petabyte-scale</guid><dc:creator><![CDATA[Data Tinkerer]]></dc:creator><pubDate>Thu, 10 Apr 2025 06:22:40 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!T_KN!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F682b198e-3ef1-40ee-b823-b52f80699458_1024x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!T_KN!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F682b198e-3ef1-40ee-b823-b52f80699458_1024x1024.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!T_KN!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F682b198e-3ef1-40ee-b823-b52f80699458_1024x1024.png 424w, https://substackcdn.com/image/fetch/$s_!T_KN!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F682b198e-3ef1-40ee-b823-b52f80699458_1024x1024.png 848w, https://substackcdn.com/image/fetch/$s_!T_KN!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F682b198e-3ef1-40ee-b823-b52f80699458_1024x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!T_KN!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F682b198e-3ef1-40ee-b823-b52f80699458_1024x1024.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!T_KN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F682b198e-3ef1-40ee-b823-b52f80699458_1024x1024.png" width="1024" height="1024" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/682b198e-3ef1-40ee-b823-b52f80699458_1024x1024.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1024,&quot;width&quot;:1024,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1708241,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.datatinkerer.io/i/160983286?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F682b198e-3ef1-40ee-b823-b52f80699458_1024x1024.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!T_KN!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F682b198e-3ef1-40ee-b823-b52f80699458_1024x1024.png 424w, https://substackcdn.com/image/fetch/$s_!T_KN!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F682b198e-3ef1-40ee-b823-b52f80699458_1024x1024.png 848w, https://substackcdn.com/image/fetch/$s_!T_KN!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F682b198e-3ef1-40ee-b823-b52f80699458_1024x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!T_KN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F682b198e-3ef1-40ee-b823-b52f80699458_1024x1024.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3>TL;DR</h3>
      <p>
          <a href="https://www.datatinkerer.io/p/how-airtable-made-archive-validation-work-at-petabyte-scale">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[How HubSpot Optimized Logging to Save Millions]]></title><description><![CDATA[By refining log storage and retention, HubSpot reduced costs by 55.7% and improved query performance by 50x]]></description><link>https://www.datatinkerer.io/p/how-hubspot-optimized-logging-to-save-millions</link><guid isPermaLink="false">https://www.datatinkerer.io/p/how-hubspot-optimized-logging-to-save-millions</guid><dc:creator><![CDATA[Data Tinkerer]]></dc:creator><pubDate>Thu, 20 Mar 2025 03:55:16 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!Ozzx!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd038ac92-6001-4b71-88f3-b1cf0fe98b56_1146x703.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h3>TL;DR</h3><div><hr></div><h4><strong>Situation</strong></h4><p>HubSpot's backend performance team identified that Amazon S3 storage costs accounted for approximately 45% to 50% of daily expenses, with the 'hubspot-live-logs-prod' bucket alone responsible for 20% of these costs.</p><h4><strong>Task</strong></h4><p>The team aimed to reduce storage costs by addressing the inefficiencies in their logging system, particularly focusing on the large volumes of raw JSON logs that were not being efficiently compacted.</p><h4><strong>Action</strong></h4><ul><li><p><strong>Log Retention Review</strong>: They discovered that raw JSON logs were retained for 730 days, while compressed ORC logs were kept for 460 days. Aligning the retention period to 460 days for both formats reduced unnecessary storage.&#8203;</p></li><li><p><strong>Improved Compression</strong>: By enhancing their Spark compaction process, they increased the conversion rate of raw JSON logs to the more storage-efficient ORC format, achieving a compression ratio where ORC logs were about 5% the size of the original JSON logs.</p></li></ul><h4><strong>Result</strong></h4><p>These measures led to a 55.7% reduction in monthly JSON log storage costs, translating to annual savings in the seven-figure range. Additionally, engineers experienced faster log query times, with some reporting reductions from 30 minutes to just 36 seconds.</p><h4><strong>Use Cases</strong></h4><p>Cost monitoring, Log retention, Log volume reduction</p><h4><strong>Tech Stack/Framework</strong></h4><p>AWS Athena, Amazon S3, Apache Spark, Apache Mesos, Redash</p><div><hr></div><h3>Explained Further</h3><div><hr></div><h4><strong>Saving Millions on Logging</strong></h4>
      <p>
          <a href="https://www.datatinkerer.io/p/how-hubspot-optimized-logging-to-save-millions">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[Scaling Apache Flink: How Reddit Cut Memory Usage by 60%]]></title><description><![CDATA[Optimizing real-time ad validation with field filtering, tiered storage, and infrastructure enhancements.]]></description><link>https://www.datatinkerer.io/p/scaling-apache-flink-how-reddit-cut-memory-usage-by-60-percent</link><guid isPermaLink="false">https://www.datatinkerer.io/p/scaling-apache-flink-how-reddit-cut-memory-usage-by-60-percent</guid><dc:creator><![CDATA[Data Tinkerer]]></dc:creator><pubDate>Wed, 19 Feb 2025 06:18:34 GMT</pubDate><enclosure url="https://images.unsplash.com/photo-1616509091215-57bbece93654?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxfHxyZWRkaXR8ZW58MHx8fHwxNzM5OTQ0NTgwfDA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://images.unsplash.com/photo-1616509091215-57bbece93654?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxfHxyZWRkaXR8ZW58MHx8fHwxNzM5OTQ0NTgwfDA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://images.unsplash.com/photo-1616509091215-57bbece93654?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxfHxyZWRkaXR8ZW58MHx8fHwxNzM5OTQ0NTgwfDA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 424w, https://images.unsplash.com/photo-1616509091215-57bbece93654?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxfHxyZWRkaXR8ZW58MHx8fHwxNzM5OTQ0NTgwfDA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 848w, https://images.unsplash.com/photo-1616509091215-57bbece93654?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxfHxyZWRkaXR8ZW58MHx8fHwxNzM5OTQ0NTgwfDA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 1272w, https://images.unsplash.com/photo-1616509091215-57bbece93654?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxfHxyZWRkaXR8ZW58MHx8fHwxNzM5OTQ0NTgwfDA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 1456w" sizes="100vw"><img src="https://images.unsplash.com/photo-1616509091215-57bbece93654?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxfHxyZWRkaXR8ZW58MHx8fHwxNzM5OTQ0NTgwfDA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080" width="5184" height="3888" data-attrs="{&quot;src&quot;:&quot;https://images.unsplash.com/photo-1616509091215-57bbece93654?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxfHxyZWRkaXR8ZW58MHx8fHwxNzM5OTQ0NTgwfDA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:3888,&quot;width&quot;:5184,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;red and white 8 logo&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="red and white 8 logo" title="red and white 8 logo" srcset="https://images.unsplash.com/photo-1616509091215-57bbece93654?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxfHxyZWRkaXR8ZW58MHx8fHwxNzM5OTQ0NTgwfDA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 424w, https://images.unsplash.com/photo-1616509091215-57bbece93654?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxfHxyZWRkaXR8ZW58MHx8fHwxNzM5OTQ0NTgwfDA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 848w, https://images.unsplash.com/photo-1616509091215-57bbece93654?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxfHxyZWRkaXR8ZW58MHx8fHwxNzM5OTQ0NTgwfDA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 1272w, https://images.unsplash.com/photo-1616509091215-57bbece93654?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxfHxyZWRkaXR8ZW58MHx8fHwxNzM5OTQ0NTgwfDA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Photo by <a href="true">Brett Jordan</a> on <a href="https://unsplash.com">Unsplash</a></figcaption></figure></div><div><hr></div><h4><strong>Situation</strong></h4><p>Reddit's advertising platform processes thousands of ad engagement events per second, necessitating real-time validation and enrichment to ensure accurate reporting and prevent budget overdelivery.</p><h4><strong>Task</strong></h4><p>Develop a scalable, real-time ad event validation system capable of efficiently handling high event volumes while maintaining performance and reliability.</p><h4><strong>Action</strong></h4><p>The engineering team developed the Ad Events Validator (AEV) utilizing Apache Flink to correlate ad server events with user engagement events. To overcome issues related to large state sizes and resource demands, they implemented:</p><ul><li><p><strong>Field Filtering:</strong> Conducted a thorough analysis of downstream data consumption, establishing an allowlist that significantly reduced the event payload size by 90%, leading to CPU and memory usage reductions of 25% and 60%, respectively.</p></li><li><p><strong>Tiered State Storage:</strong> Integrated Apache Cassandra for external state storage, effectively reducing in-memory state size and enhancing the efficiency of checkpointing and system recovery processes.</p></li></ul><h4><strong>Result</strong></h4><p>These strategic enhancements resulted in a more scalable and cost-efficient AEV system, improving overall performance and operational effectiveness.</p><h4><strong>Use Cases</strong></h4><p>Real-Time Event Validation, Data Enrichment, Resource Optimization</p><h4><strong>Tech Stack/Framework</strong></h4><p>Apache Flink, Apache Kafka, Apache Cassandra</p><div><hr></div><h3>Explained Further</h3><div><hr></div><h4><strong>Background</strong></h4><p>Reddit processes thousands of ad engagement events per second. These events require validation and enrichment before being sent to downstream systems. Key components of this validation process include applying a standardized look-back window and filtering out suspected invalid traffic.</p><p>In addition to a batch validation pipeline, a near real-time pipeline improves budget spend accuracy and provides advertisers with real-time insights into campaign performance. This real-time component, known as the <strong>Ad Events Validator (AEV)</strong>, is built using Apache Flink. AEV matches ad server events with engagement events and writes the validated results to a separate Kafka topic for downstream consumption. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!6utL!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F441259f5-fa8b-48c2-a11e-6a8df3128807_1080x441.webp" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!6utL!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F441259f5-fa8b-48c2-a11e-6a8df3128807_1080x441.webp 424w, https://substackcdn.com/image/fetch/$s_!6utL!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F441259f5-fa8b-48c2-a11e-6a8df3128807_1080x441.webp 848w, https://substackcdn.com/image/fetch/$s_!6utL!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F441259f5-fa8b-48c2-a11e-6a8df3128807_1080x441.webp 1272w, https://substackcdn.com/image/fetch/$s_!6utL!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F441259f5-fa8b-48c2-a11e-6a8df3128807_1080x441.webp 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!6utL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F441259f5-fa8b-48c2-a11e-6a8df3128807_1080x441.webp" width="1080" height="441" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/441259f5-fa8b-48c2-a11e-6a8df3128807_1080x441.webp&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:441,&quot;width&quot;:1080,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:16750,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/webp&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!6utL!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F441259f5-fa8b-48c2-a11e-6a8df3128807_1080x441.webp 424w, https://substackcdn.com/image/fetch/$s_!6utL!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F441259f5-fa8b-48c2-a11e-6a8df3128807_1080x441.webp 848w, https://substackcdn.com/image/fetch/$s_!6utL!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F441259f5-fa8b-48c2-a11e-6a8df3128807_1080x441.webp 1272w, https://substackcdn.com/image/fetch/$s_!6utL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F441259f5-fa8b-48c2-a11e-6a8df3128807_1080x441.webp 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Overview of the real-time ad engagement event validation system (Source: Reddit)</figcaption></figure></div><p>Building and maintaining AEV though, presented several challenges to the Reddit team</p><div><hr></div><h4><strong>1st Challenge: Addressing High State Size Issues</strong></h4>
      <p>
          <a href="https://www.datatinkerer.io/p/scaling-apache-flink-how-reddit-cut-memory-usage-by-60-percent">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[ML Training Too Slow? Yelp’s 1,400x Speed Boost Fixes That]]></title><description><![CDATA[Discover the data pipeline and GPU optimisations that made it happen]]></description><link>https://www.datatinkerer.io/p/ml-training-too-slow-yelps-1400x-speed-boost-fixes-that</link><guid isPermaLink="false">https://www.datatinkerer.io/p/ml-training-too-slow-yelps-1400x-speed-boost-fixes-that</guid><dc:creator><![CDATA[Data Tinkerer]]></dc:creator><pubDate>Wed, 12 Feb 2025 05:46:32 GMT</pubDate><enclosure url="https://images.unsplash.com/photo-1730818876892-7e7c6ddf3dc6?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwzfHx5ZWxwfGVufDB8fHx8MTczOTMzODU5MHww&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://images.unsplash.com/photo-1730818876892-7e7c6ddf3dc6?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwzfHx5ZWxwfGVufDB8fHx8MTczOTMzODU5MHww&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://images.unsplash.com/photo-1730818876892-7e7c6ddf3dc6?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwzfHx5ZWxwfGVufDB8fHx8MTczOTMzODU5MHww&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 424w, https://images.unsplash.com/photo-1730818876892-7e7c6ddf3dc6?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwzfHx5ZWxwfGVufDB8fHx8MTczOTMzODU5MHww&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 848w, https://images.unsplash.com/photo-1730818876892-7e7c6ddf3dc6?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwzfHx5ZWxwfGVufDB8fHx8MTczOTMzODU5MHww&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 1272w, https://images.unsplash.com/photo-1730818876892-7e7c6ddf3dc6?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwzfHx5ZWxwfGVufDB8fHx8MTczOTMzODU5MHww&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 1456w" sizes="100vw"><img src="https://images.unsplash.com/photo-1730818876892-7e7c6ddf3dc6?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwzfHx5ZWxwfGVufDB8fHx8MTczOTMzODU5MHww&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080" width="6240" height="4160" data-attrs="{&quot;src&quot;:&quot;https://images.unsplash.com/photo-1730818876892-7e7c6ddf3dc6?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwzfHx5ZWxwfGVufDB8fHx8MTczOTMzODU5MHww&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:4160,&quot;width&quot;:6240,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;A cell phone sitting on top of a wooden table&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="A cell phone sitting on top of a wooden table" title="A cell phone sitting on top of a wooden table" srcset="https://images.unsplash.com/photo-1730818876892-7e7c6ddf3dc6?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwzfHx5ZWxwfGVufDB8fHx8MTczOTMzODU5MHww&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 424w, https://images.unsplash.com/photo-1730818876892-7e7c6ddf3dc6?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwzfHx5ZWxwfGVufDB8fHx8MTczOTMzODU5MHww&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 848w, https://images.unsplash.com/photo-1730818876892-7e7c6ddf3dc6?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwzfHx5ZWxwfGVufDB8fHx8MTczOTMzODU5MHww&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 1272w, https://images.unsplash.com/photo-1730818876892-7e7c6ddf3dc6?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwzfHx5ZWxwfGVufDB8fHx8MTczOTMzODU5MHww&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Photo by <a href="true">appshunter.io</a> on <a href="https://unsplash.com">Unsplash</a></figcaption></figure></div><h3>TL;DR</h3><div><hr></div><h4><strong>Situation</strong></h4><p>Yelp's ad revenue relies on predicting which ads users are likely to click on, using a model called "Wide and Deep Neural Network." Initially, training this model on 450 million data samples took 75 hours per cycle, which was too slow. Yelp wanted to handle 2 billion samples and reduce training time to under an hour per cycle.</p><h4><strong>Task</strong></h4><p>The goal was to speed up the training process by improving how data is stored and read, and by using multiple GPUs to handle more data at once.</p><h4><strong>Action</strong></h4><ul><li><p><strong>Data Storage</strong>: Yelp stored the training data in Parquet format on Amazon's S3 storage, which works well with their data processing system, Spark. They found that a tool called Petastorm was too slow for their needs, so they developed their own system called ArrowStreamServer. This new system reads and sends data more efficiently, reducing the time to process 9 million samples from over 13 minutes to about 19 seconds.</p></li><li><p><strong>Distributed Training</strong>: Yelp initially used a method called MirroredStrategy to train the model on multiple GPUs but found it didn't work well as they added more GPUs. They switched to a tool called Horovod, which allowed them to efficiently use up to 8 GPUs at once, significantly speeding up the training process.</p></li></ul><h4><strong>Result</strong></h4><p>By implementing these changes, Yelp achieved a total speed increase of about 1,400 times in their model training. This means they can now train their ad prediction models much faster, allowing them to handle more data and improve their ad services.</p><h4><strong>Use Cases</strong></h4><p>Large-Scale ML Training, ML Training Optimisation, Enhancing Data Pipeline Efficiency</p><h4><strong>Tech Stack/Framework</strong></h4><p>TensorFlow, Horovod, Keras, PyArrow, Amazon S3, Apache Spark</p><div><hr></div><h3>Explained Further</h3><div><hr></div><h4><strong>The Challenge</strong></h4>
      <p>
          <a href="https://www.datatinkerer.io/p/ml-training-too-slow-yelps-1400x-speed-boost-fixes-that">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[Inside Meta's Data Flow Discovery]]></title><description><![CDATA[Discover How Meta Tracks Data Journeys to Safeguard User Privacy at Scale]]></description><link>https://www.datatinkerer.io/p/inside-metas-data-flow-discovery</link><guid isPermaLink="false">https://www.datatinkerer.io/p/inside-metas-data-flow-discovery</guid><dc:creator><![CDATA[Data Tinkerer]]></dc:creator><pubDate>Tue, 04 Feb 2025 23:39:49 GMT</pubDate><enclosure url="https://images.unsplash.com/photo-1533895328642-8035bacd565a?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxMnx8ZmFjZWJvb2t8ZW58MHx8fHwxNzM4NjQyMDg4fDA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://images.unsplash.com/photo-1533895328642-8035bacd565a?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxMnx8ZmFjZWJvb2t8ZW58MHx8fHwxNzM4NjQyMDg4fDA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://images.unsplash.com/photo-1533895328642-8035bacd565a?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxMnx8ZmFjZWJvb2t8ZW58MHx8fHwxNzM4NjQyMDg4fDA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 424w, https://images.unsplash.com/photo-1533895328642-8035bacd565a?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxMnx8ZmFjZWJvb2t8ZW58MHx8fHwxNzM4NjQyMDg4fDA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 848w, https://images.unsplash.com/photo-1533895328642-8035bacd565a?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxMnx8ZmFjZWJvb2t8ZW58MHx8fHwxNzM4NjQyMDg4fDA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 1272w, https://images.unsplash.com/photo-1533895328642-8035bacd565a?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxMnx8ZmFjZWJvb2t8ZW58MHx8fHwxNzM4NjQyMDg4fDA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 1456w" sizes="100vw"><img src="https://images.unsplash.com/photo-1533895328642-8035bacd565a?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxMnx8ZmFjZWJvb2t8ZW58MHx8fHwxNzM4NjQyMDg4fDA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080" width="5472" height="3648" data-attrs="{&quot;src&quot;:&quot;https://images.unsplash.com/photo-1533895328642-8035bacd565a?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxMnx8ZmFjZWJvb2t8ZW58MHx8fHwxNzM4NjQyMDg4fDA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:3648,&quot;width&quot;:5472,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;television showing man using binoculars&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="television showing man using binoculars" title="television showing man using binoculars" srcset="https://images.unsplash.com/photo-1533895328642-8035bacd565a?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxMnx8ZmFjZWJvb2t8ZW58MHx8fHwxNzM4NjQyMDg4fDA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 424w, https://images.unsplash.com/photo-1533895328642-8035bacd565a?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxMnx8ZmFjZWJvb2t8ZW58MHx8fHwxNzM4NjQyMDg4fDA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 848w, https://images.unsplash.com/photo-1533895328642-8035bacd565a?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxMnx8ZmFjZWJvb2t8ZW58MHx8fHwxNzM4NjQyMDg4fDA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 1272w, https://images.unsplash.com/photo-1533895328642-8035bacd565a?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwxMnx8ZmFjZWJvb2t8ZW58MHx8fHwxNzM4NjQyMDg4fDA&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Photo by <a href="true">Glen Carrie</a> on <a href="https://unsplash.com">Unsplash</a></figcaption></figure></div><h3>TL;DR</h3><div><hr></div><h4><strong>Situation</strong></h4><p>Meta handles vast amounts of user data across its platforms, requiring strong privacy controls to protect sensitive information. A critical component of this effort is <strong>data lineage</strong>, which helps trace how data moves across different systems, ensuring compliance with privacy policies like <strong>purpose limitation</strong>.</p><h4><strong>Task</strong></h4><p>Meta needed a scalable and automated way to <strong>track data lineage</strong> across millions of assets, including databases, web services, and AI systems. This required moving beyond <strong>manual data flow documentation</strong> to a more robust, automated discovery process.</p><h4><strong>Action</strong></h4><ul><li><p><strong>Data Flow Collection</strong> &#8211; Used <strong>static code analysis, runtime instrumentation, and input/output matching</strong> to track data across stacks (Hack, C++, Python, SQL).</p></li><li><p><strong>Privacy Probes</strong> &#8211; Captured real-time <strong>runtime signals</strong>, identifying how and where sensitive data is logged, stored, or transformed.</p></li><li><p><strong>Automated Lineage Graphs</strong> &#8211; Created <strong>scalable data flow visualizations</strong> to streamline privacy control implementation.</p></li><li><p><strong>AI &amp; Data Warehouse Integration</strong> &#8211; Ensured <strong>end-to-end traceability</strong> across AI models, databases, and batch-processing systems.</p></li><li><p><strong>Iterative Filtering Tool</strong> &#8211; Allowed developers to <strong>refine lineage graphs</strong>, isolating relevant data flows and removing noise.</p></li></ul><h4><strong>Result</strong></h4><p>Meta&#8217;s data lineage system <strong>reduced engineering time, improved compliance accuracy, and automated privacy enforcement</strong>. It enabled developers to quickly identify and secure sensitive data flows while ensuring continuous monitoring at scale. These innovations enhanced user data protection across Meta&#8217;s ecosystem</p><h4><strong>Use Cases</strong></h4><p>Privacy Enforcement, Compliance Monitoring, Data Lineage</p><h4><strong>Tech Stack/Framework</strong></h4><p>Python, SQL, C++, PyTorch, Presto, Spark</p><div><hr></div><h3>Explained Further</h3><div><hr></div><p>Meta's Privacy Aware Infrastructure (PAI) is designed to embed privacy controls within its systems, ensuring user data is handled responsibly. A foundational element of PAI is data lineage, which traces the journey of data across various platforms, providing a comprehensive view of its flow from collection to processing and storage. This capability is crucial for implementing privacy measures like purpose limitation, which restricts data usage to specific, intended purposes</p><div><hr></div><h4><strong>Understanding Data Lineage at Meta</strong></h4><p>Data lineage involves mapping out how data moves through Meta's vast ecosystem, connecting source assets (e.g., database tables where data originates) to sink assets (e.g., tables or systems where data is stored or processed). </p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!cqvn!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68826e8b-0ab0-472a-8287-e441f86a68a6_1536x326.webp" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!cqvn!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68826e8b-0ab0-472a-8287-e441f86a68a6_1536x326.webp 424w, https://substackcdn.com/image/fetch/$s_!cqvn!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68826e8b-0ab0-472a-8287-e441f86a68a6_1536x326.webp 848w, https://substackcdn.com/image/fetch/$s_!cqvn!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68826e8b-0ab0-472a-8287-e441f86a68a6_1536x326.webp 1272w, https://substackcdn.com/image/fetch/$s_!cqvn!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68826e8b-0ab0-472a-8287-e441f86a68a6_1536x326.webp 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!cqvn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68826e8b-0ab0-472a-8287-e441f86a68a6_1536x326.webp" width="1456" height="309" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/68826e8b-0ab0-472a-8287-e441f86a68a6_1536x326.webp&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:309,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:36606,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/webp&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!cqvn!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68826e8b-0ab0-472a-8287-e441f86a68a6_1536x326.webp 424w, https://substackcdn.com/image/fetch/$s_!cqvn!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68826e8b-0ab0-472a-8287-e441f86a68a6_1536x326.webp 848w, https://substackcdn.com/image/fetch/$s_!cqvn!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68826e8b-0ab0-472a-8287-e441f86a68a6_1536x326.webp 1272w, https://substackcdn.com/image/fetch/$s_!cqvn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68826e8b-0ab0-472a-8287-e441f86a68a6_1536x326.webp 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">PAI Workflow (Source: Meta)</figcaption></figure></div><p>This mapping is essential for:</p>
      <p>
          <a href="https://www.datatinkerer.io/p/inside-metas-data-flow-discovery">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[Scaling Real-Time Analytics: How Expedia Cut Costs by 40% While Supporting 450+ Concurrent Users]]></title><description><![CDATA[Learn how the Optics Framework enabled seamless data insights with <15-second latency for global teams]]></description><link>https://www.datatinkerer.io/p/scaling-real-time-analytics-how-expedia-cut-costs-by-40-percent</link><guid isPermaLink="false">https://www.datatinkerer.io/p/scaling-real-time-analytics-how-expedia-cut-costs-by-40-percent</guid><dc:creator><![CDATA[Data Tinkerer]]></dc:creator><pubDate>Tue, 28 Jan 2025 22:20:41 GMT</pubDate><enclosure url="https://images.unsplash.com/photo-1660991473393-3612a4e49127?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwzfHxleHBlZGlhfGVufDB8fHx8MTczODAzOTc0M3ww&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://images.unsplash.com/photo-1660991473393-3612a4e49127?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwzfHxleHBlZGlhfGVufDB8fHx8MTczODAzOTc0M3ww&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://images.unsplash.com/photo-1660991473393-3612a4e49127?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwzfHxleHBlZGlhfGVufDB8fHx8MTczODAzOTc0M3ww&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 424w, https://images.unsplash.com/photo-1660991473393-3612a4e49127?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwzfHxleHBlZGlhfGVufDB8fHx8MTczODAzOTc0M3ww&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 848w, https://images.unsplash.com/photo-1660991473393-3612a4e49127?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwzfHxleHBlZGlhfGVufDB8fHx8MTczODAzOTc0M3ww&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 1272w, https://images.unsplash.com/photo-1660991473393-3612a4e49127?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwzfHxleHBlZGlhfGVufDB8fHx8MTczODAzOTc0M3ww&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 1456w" sizes="100vw"><img src="https://images.unsplash.com/photo-1660991473393-3612a4e49127?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwzfHxleHBlZGlhfGVufDB8fHx8MTczODAzOTc0M3ww&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080" width="5635" height="3757" data-attrs="{&quot;src&quot;:&quot;https://images.unsplash.com/photo-1660991473393-3612a4e49127?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwzfHxleHBlZGlhfGVufDB8fHx8MTczODAzOTc0M3ww&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:3757,&quot;width&quot;:5635,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;a room with a large window and a couch and potted plants&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="a room with a large window and a couch and potted plants" title="a room with a large window and a couch and potted plants" srcset="https://images.unsplash.com/photo-1660991473393-3612a4e49127?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwzfHxleHBlZGlhfGVufDB8fHx8MTczODAzOTc0M3ww&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 424w, https://images.unsplash.com/photo-1660991473393-3612a4e49127?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwzfHxleHBlZGlhfGVufDB8fHx8MTczODAzOTc0M3ww&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 848w, https://images.unsplash.com/photo-1660991473393-3612a4e49127?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwzfHxleHBlZGlhfGVufDB8fHx8MTczODAzOTc0M3ww&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 1272w, https://images.unsplash.com/photo-1660991473393-3612a4e49127?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHwzfHxleHBlZGlhfGVufDB8fHx8MTczODAzOTc0M3ww&amp;ixlib=rb-4.0.3&amp;q=80&amp;w=1080 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Photo by <a href="true">Hotel Lal Garh Fort and Palace</a> on <a href="https://unsplash.com">Unsplash</a></figcaption></figure></div><h3>TL;DR</h3><div><hr></div><h4><strong>Situation</strong></h4><p>Expedia Group needed a scalable and cost-effective real-time analytics solution (&lt;15 seconds latency) to process high-volume data (~4500 events/sec) and support global service partners in optimizing operations and enhancing performance.</p><h4><strong>Task</strong></h4><p>Design a solution to process and present real-time data with blazing-fast query speeds while addressing limitations of existing tools (e.g., Snowflake and Looker) in terms of scalability, latency, and user experience.</p><h4><strong>Action</strong></h4><p>Developed a new architecture using Apache Druid for real-time ingestion, optimized microservices for data processing, and built a custom modular UI library with a Data Resolver API to deliver tailored analytics based on user roles.</p><h4><strong>Result</strong></h4><p>The solution achieved a 5x increase in user base, 30-40% reduction in costs, 15-second data latency, and 99.9% SLA uptime. It supported 1,800 users with sub-1-second response times, enhancing decision-making and operational efficiency globally.</p><h4><strong>Use Cases</strong></h4><p>Real-Time Insights, Operational Efficiency, Scalability for Concurrent Users</p><h4><strong>Tech Stack/Framework</strong></h4><p>Python, Apache Druid, Apache Hive, Apache Kafka, Looker, Snowflake</p><div><hr></div><h3>Explained Further</h3><div><hr></div><h4><strong>Real-Time Challenges for Expedia</strong></h4>
      <p>
          <a href="https://www.datatinkerer.io/p/scaling-real-time-analytics-how-expedia-cut-costs-by-40-percent">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[How Datadog Achieved 99% Timeout Reduction with 20x Scalability Boost]]></title><description><![CDATA[Discover the architecture that cut costs by 50% and unlocked massive scalability]]></description><link>https://www.datatinkerer.io/p/how-datadog-achieved-99-percent-timeout-reduction</link><guid isPermaLink="false">https://www.datatinkerer.io/p/how-datadog-achieved-99-percent-timeout-reduction</guid><dc:creator><![CDATA[Data Tinkerer]]></dc:creator><pubDate>Wed, 22 Jan 2025 00:10:56 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!6aZh!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F138a603f-9b77-47e2-897a-82c96d0ea454_3800x1930.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!6aZh!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F138a603f-9b77-47e2-897a-82c96d0ea454_3800x1930.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!6aZh!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F138a603f-9b77-47e2-897a-82c96d0ea454_3800x1930.jpeg 424w, https://substackcdn.com/image/fetch/$s_!6aZh!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F138a603f-9b77-47e2-897a-82c96d0ea454_3800x1930.jpeg 848w, https://substackcdn.com/image/fetch/$s_!6aZh!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F138a603f-9b77-47e2-897a-82c96d0ea454_3800x1930.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!6aZh!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F138a603f-9b77-47e2-897a-82c96d0ea454_3800x1930.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!6aZh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F138a603f-9b77-47e2-897a-82c96d0ea454_3800x1930.jpeg" width="1456" height="739" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/138a603f-9b77-47e2-897a-82c96d0ea454_3800x1930.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:739,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:351614,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!6aZh!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F138a603f-9b77-47e2-897a-82c96d0ea454_3800x1930.jpeg 424w, https://substackcdn.com/image/fetch/$s_!6aZh!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F138a603f-9b77-47e2-897a-82c96d0ea454_3800x1930.jpeg 848w, https://substackcdn.com/image/fetch/$s_!6aZh!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F138a603f-9b77-47e2-897a-82c96d0ea454_3800x1930.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!6aZh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F138a603f-9b77-47e2-897a-82c96d0ea454_3800x1930.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(Source: Datadog)</figcaption></figure></div><h3>TL;DR</h3><div><hr></div><h4><strong>Situation</strong></h4><p>Datadog's time-series database, designed in 2016, struggled to manage a 30x growth in data volume and rising query complexity, resulting in slower performance and higher maintenance overhead.</p><h4><strong>Task</strong></h4><p>Develop a scalable indexing system to efficiently process high-cardinality data while improving query speed and reducing operational costs.</p><h4><strong>Action</strong></h4><p>The team implemented an inverted index inspired by search engines, mapping tags to time-series IDs. Using RocksDB for storage, they ensured scalability, reliability, and efficient query filtering.</p><h4><strong>Result</strong></h4><p>Query performance improved by 99%, enabling support for 20x higher cardinality metrics, reducing query timeouts, and cutting operational costs by nearly 50%.</p><h4><strong>Use Cases</strong></h4><p>Real-Time Monitoring, Tag-Based Filtering, Dynamic Schema Handling, Query Execution</p><h4><strong>Tech Stack/Framework</strong></h4><p>RocksDB, Apache Kafka, SQLite, Time-Series Database, Rust</p><div><hr></div><h3>Explained Further</h3><div><hr></div><h4><strong>Understanding the Problem</strong></h4><p>Datadog&#8217;s timeseries database faced significant challenges as data volumes grew 30x between 2017 and 2022. The increasing complexity of user queries and higher data cardinality strained the existing indexing system, introduced in 2016. The original architecture became a bottleneck for query performance and required substantial maintenance.</p><div><hr></div><h4><strong>Metrics Platform Overview</strong></h4>
      <p>
          <a href="https://www.datatinkerer.io/p/how-datadog-achieved-99-percent-timeout-reduction">
              Read more
          </a>
      </p>
   ]]></content:encoded></item></channel></rss>