How Notion Brought Order to Its Data Chaos (And Why Their First Catalog Failed)

A behind-the-scenes look at the real challenges, missed steps and what finally made their data catalog work.

May 22, 2025

Fellow Data Tinkerers!

Quick note before this week’s deep dive. Thanks for reading and subscribing, I really do mean it. If you’ve got feedback, just hit reply. I read every response.

Data Tinkerer has always been about sharing what actually works in data, beyond just tools and tech. The deep dives will keep coming but I want to start spotlighting the stuff we don’t talk about enough: the day-to-day challenges, business outcomes, the challenges and the learnings.

I want to feature stories from people in data roles: senior data engineers, lead analysts, heads of data, you name it. If you’ve got a story, lesson, recent technical win or even a battle scar from the data trenches, let’s get it in front of almost 1,000 smart and engaged peers.

You don’t need to be a “writer”, I’ll help your story shine. Plus, guest contributors get a shoutout in the newsletter and on LinkedIn (If you want).

Keen to share your data story? Just reply to this email or message me on Substack and we’ll tee it up.

Now with that out of the way, let’s get to this week’s article on Notion’s data journey!

TL;DR

Situation

Notion’s data exploded as the company grew. More products, more features, more teams and data everywhere. Everything lived in a wild mix of JSON, with no structure, no ownership and no consistent way to find or trust analytics. Tribal knowledge ruled, governance was shaky and onboarding was painful. Something had to give.

Task

Build a robust, user-friendly data catalog that could handle hundreds of analytics tables, tame unstructured data, keep documentation up to date and actually get used by engineers and analysts

Action

Engineers define every analytics event as a TypeScript type with ownership and descriptions and nightly automation syncs schemas to Acryl DataHub so the catalog always matches the codebase.
LLMs draft documentation for tables lacking descriptions, but every AI-generated doc goes through human review for accuracy.
A dedicated review dashboard keeps all documentation up to date, with feedback loops that improve quality over time.

Result

Notion turned its data mess into a living, reliable and actually-used data catalog. Engineers and analysts can now trust documentation, onboarding is faster, stale docs are rare and the whole data ecosystem is in sync with what’s actually running in production. Manual doc writing is now the exception, not the rule.

Use Cases

Data discovery, data cataloguing, data onboarding, data validation

Tech Stack/Framework

Acryl DataHub, TypeScript, Amazon S3, LLM

Explained Further

Introduction: from chaos to catalog

Over the past few years, Notion’s internal data landscape exploded. More products, more features and more teams generating a tidal wave of new data. With that surge came a burning need: a robust, user-friendly data catalog. This is the inside story of how the Notion team wrangled a mess of unstructured data, why some solutions fell flat and what actually worked as the system matured.

This journey unfolds in three acts:

Early chaos: where anything goes and very little makes sense
Building a foundation: introducing a catalog but realizing it’s not enough
Rethinking the system: designing for real engagement and sustainable growth (that’s the juicy part of the article)

Now, without further ado, let’s get to it.

Act one: life in the data wild west

Notion’s early approach was about speed. Data grew organically, mostly in unstructured formats like JSON and with tools like Amplitude, the team could just quickly integrate and analyze data. No one worried much about naming conventions, event schemas or who actually owned which data. Developers, data scientists and product managers each wanted different things so everyone optimized for getting what they needed, fast.

But those choices came with a cost:

Tribal knowledge ruled: If you knew, you knew. If you didn’t, you were out of luck.
No clear criticality: All data looked the same; nothing was flagged as mission-critical or just nice to have.
Unclear ownership: One small change could break things for someone else and no one would know until it was too late.
Poor discoverability: Teams were reinventing wheels because they couldn’t find what already existed.
A tangle of sources: Warehouses, streams, lakes and operational stores all spoke different dialects.

This chaos couldn’t last. As Notion scaled, they hit the limits of speed-over-order. Product teams couldn’t trust analytics, governance issues were piling up and onboarding new people was painful.

Notion needed order and they needed it fast.

Act two: building the first real foundation

The first shot at fixing this was to bring in a proper catalog: Acryl DataHub. Notion wired it up to the data warehouse to surface table names and schemas. They layered in event tiering and an ownership model so at least someone was responsible for high-priority assets.

But for all the technical progress, adoption lagged. Engineers and analysts weren’t flocking to the new catalog. Engagement stayed low.

What went wrong? Notion realized that technical integration doesn’t magically create user engagement. The next challenge was to make the catalog truly usable, not just technically sound.

Act three: rethinking for real engagement

After some digging, Notion’s engineers pinpointed three big reasons why the catalog wasn’t clicking:

Unstructured data everywhere. Too much source data still lived in those sprawling JSON blobs with no fixed schema, making it hard to present cleanly in the catalog.
Metadata was missing or stale. Many tables had little or no metadata, no descriptions and no context. Even where there were descriptions, business logic would evolve and leave them stale.
Descriptions didn’t propagate. Even if someone took the time to document a table, that information often didn’t follow the data to downstream tools or derived tables.

Clearly, fixing these issues would mean changing both how data was created and how it was described and surfaced to users.

Decisions that actually moved the needle

The team needed a single, definitive source of truth for what each analytics event or table looked like. That meant picking an Interface Definition Language (IDL) that could serve both code and catalog needs.

Design decision 1: TypeScript as the IDL

Instead of reaching for industry standards like Avro or JSON Schema, Notion went all-in on TypeScript for the IDL. Why?

Already embedded. Notion’s codebase had tons of TypeScript type definitions. No need to reinvent the wheel.
Type safety. Advanced TypeScript features made it easy to enforce strict model definitions.
Engineer-friendly. Most Notion engineers already used TypeScript, keeping learning curves to a minimum.

The bonus: TypeScript types could also generate types for other languages (Swift, Kotlin, Python) that Notion uses across different platforms.

Design decision 2: JSON schema for catalog compatibility

Of course, most data catalog tools don’t speak TypeScript natively. So the team set up a translation pipeline: convert TypeScript types into JSON Schema, then import that into the data catalog. This approach checked three boxes:

Natural mapping. TypeScript object shapes closely match JSON Schema, making the translation straightforward.
Fast to implement. Existing libraries (like ts-json-schema-generator) made this a plug-and-play solution.
Future-proof. JSON is a lingua franca for downstream systems, not just TypeScript or JavaScript.

By combining TypeScript for development and JSON Schema for cataloging, Notion could finally bring unstructured data under control.

But with several hundred analytics tables, manually writing and updating descriptions would never scale. Notion needed a system to automate description generation and ensure every table was discoverable and documented.

Design decision 3: AI-generated metadata, human-reviewed

Notion decided to use generative AI to draft table and column descriptions, but always with a human reviewer before anything got published.

Compile all the metadata. Pull together everything: table contents, context, even why this data exists.
Automate description generation. Use AI to draft initial descriptions, factoring in all available metadata and table lineage.
Human-in-the-loop review. Data owners review new drafts, flag any issues and feedback is automatically folded into future runs.

This closed the loop: less grunt work, fewer stale docs but still plenty of human oversight. But how was this actually implemented?

Generating schemas from unstructured JSON data

Alright, let’s break down how they actually got from JSON to something engineers and analysts could trust. To make it concrete, let’s use the create block analytics event. A classic example that shows up every time someone creates a new block in Notion.

Before they started cleaning things up, this was just a giant JSON blob dumped straight into the warehouse. If you wanted to know what the payload looked like, good luck.

So, how’d they fix it? Enter the three-phase schema process. Think of it as taking the messy event, running it through the transformation process and ending up with something you can actually use downstream.

Step 1: engineer creates types

The first step happens at the source. Whenever a product engineer wants to log a new event, they fire up an internal tool. This tool doesn’t just let them slap together whatever JSON they want. It actually forces them to enter the event name, a description of what it does and which team owns it. Out the other side pops a TypeScript type, scaffolded and ready to go.

An example of a fully populated Typescript type for the create_block event (Source: Notion)

But we’re not done yet. The engineer is then on the hook to fill in real descriptions for every property. Every property gets an explanation, because the engineer in the future or someone else downstream will not remember the context. This all gets checked into source control, so they have a running registry of all events, what they mean, who owns them and which ones are actually important (P0, P1, etc.).

A registry of all events in codebase with metadata (Source: Notion)

This is a big step up from the old days, when all fields were optional and nobody enforced anything. Now, if you forget a required property or mix up your types, you’ll get flagged before anything hits production.

Step 2: convert TypeScript to JSON

So now that they’ve got their nicely typed event, the next step is to make those schemas available to everything else and not just the codebase. Once a day, automation kicks in and reads all the TypeScript event types. They run these through ts-json-schema-generator (an npm library) and what comes out is a proper JSON Schema for each event.

These JSON Schemas get dumped into S3. Why S3? Version history. If something blows up or they need to track down when a property got added, it’s all there, neatly versioned.

Step 3: uploading JSON schemas to the data catalog

The final leg of the relay is getting those JSON schemas into Acryl’s DataHub so analysts and engineers can actually find and use them. Another automated job scoops up the latest schemas from S3 and writes them into DataHub using the Acryl SDK. Everything in source control is what shows up in the catalog, every single day.

By piping TypeScript types straight through to the catalog, they actually keep the system in sync. No more catalog drift. Plus, engineers like it because it’s all native to their workflow. No learning some random new tool just to document an event. It’s still deeply tied to the codebase, so as the product evolves, so do the schemas and docs.

At this point, they set up schema hydration for anything coming out of engineering. But once the data team starts spinning up new tables or stripping down old ones (like the block table minus user content), you end up with missing descriptions. That’s where they let AI do the heavy lifting.

Filling the gaps with AI-generated descriptions

With recent jumps in LLM capabilities, they can now give the model a boatload of context and get something pretty decent back. Most of the metadata is already text-based and lives in code or SQL. Structured enough for the model to make sense of but flexible enough to describe weird edge cases.

The description generation process

Here is how the description generation process works in practice:

Round up the context. Start by collecting every scrap of relevant metadata such SQL models, macro code, any old table or column descriptions, the JSON Schemas, internal documentation and feedback from past reviews. If it sheds light on the table, it’s in. And they do this for the table itself and its direct upstream sources.
Hand it off to the LLM. All that metadata gets packed into a prompt and fired off to the model. The LLM tries its best to draft a solid description with as much context as they can throw at it.
Send to the owner for review. Once there’s a draft, the table or data owner gets pinged to take a look. They review, suggest tweaks or kick it back if something’s off. Their feedback gets rolled back into the process, so every round the docs get a bit sharper.
Sync it everywhere. Once a description actually passes review, it gets published to the catalog and pushed out to every place people go for datalike warehouse, BI tools, the works.

Description generation process (Source: Notion)

Human-in-the-loop review and audit log

The AI is good but it’s not perfect. Some descriptions are off or the model hallucinates. So they built a dashboard to flag every AI-generated description for a real human to check. The owner reviews, approves, or kicks it back with comments. Their feedback goes right back into the prompt for next time and every outcome is logged so they can improve over time.

Once a description passes review, it’s synced into DataHub and made available everywhere: SQL repo, warehouse docs, BI tools, you name it. This has taken a ton of manual work off their plate and made onboarding way less painful for new datasets.

Lessons learned

After all this effort, a few big takeaways stood out for the Notion team.

1. User-first, always

The best technical solution is worthless if users don’t engage. Notion met people where they already worked (using TypeScript, plugging into existing engineering workflows) and delivered metadata to the places they already checked (data science tools, dashboards).

2. Fix the source first

With data growth accelerating, it was essential to start structuring things right where they began, in the TypeScript codebase. Backfilling old tables could wait; the priority was making new data structured and discoverable from the get-go.

3. Don’t skip human oversight

No matter how slick the automation, there’s no substitute for a real person checking what goes into the catalog. Human review not only catches mistakes but also builds trust in the system’s accuracy.

The full scoop

To learn more about this, check Notion's Engineering Blog post on this topic

Again, if you’re interested in doing a guest post, just reply to this email or message me on Substack and we’ll tee something up.

Keep learning

Data Engineering

How Canva Rebuilt Its Data Pipelines for Billions of Events per Month

Data Tinkerer

May 1

How Canva Rebuilt Its Data Pipelines for Billions of Events per Month

Canva had to track billions of events to pay creators fairly and their old system couldn’t keep up. Curious how they rebuilt it? This article is for you

Read full story

Data Engineering