How Shopify Scales Taxonomy Evolution Across 10,000+ Categories With Multi-Agent AI
From reactive manual curation to continuous taxonomy evolution grounded in merchant reality.
Fellow Data Tinkerers!
Today we will look at how Shopify scales its product categorisation using agentic AI
But before that, I wanted to share with you what you could unlock if you share Data Tinkerer with just 1 more person.
There are 100+ resources to learn all things data (science, engineering, analysis). It includes videos, courses, projects and can be filtered by tech stack (Python, SQL, Spark and etc), skill level (Beginner, Intermediate and so on) provider name or free/paid. So if you know other people who like staying up to date on all things data, please share Data Tinkerer with them!
Now, with that out of the way, let’s get to Shopify’s multi-agent taxonomy
TL;DR
Situation
Shopify’s product classification system makes tens of millions of predictions daily, across a taxonomy with 10,000+ categories and 2,000+ attributes. Commerce changes fast, the taxonomy has to keep up or the whole stack starts drifting.
Task
Keep the taxonomy current at scale without relying on slow, reactive, manual curation. Fix volume, expertise and consistency problems before they hit merchants, customers and model quality.
Action
Built an AI multi-agent system: structural analysis + product-driven analysis, then intelligent synthesis. Added equivalence detection (category = broader category + attribute filters) plus automated QA via domain-specific AI judges.
Result
Taxonomy branches can be analyzed in parallel: hundreds of categories instead of a few per day. Quality improved via grounded merchant data + structural consistency, with judges filtering proposals (example: “MagSafe compatible” approved at 93% confidence).
Use Cases
Category discovery, attribute gap detection, taxonomy maintenance, search and filtering improvement
Tech Stack/Framework
AI agent, equivalence detection, multi-agent system
Explained further
Context
Last year, over 875 million people bought items from Shopify merchants. Shopify already runs a product classification system that makes tens of millions of predictions daily with a high degree of accuracy.
But classification is the easy part compared to the thing underneath it: taxonomy. Because the model doesn’t just need to be right, it also needs a clean, consistent set of labels to be right about.
That’s the challenge for Shopify: once you have 10,000+ categories and 2,000+ attributes, the taxonomy becomes its own product with its own failure modes. It can get stale. It can get inconsistent. It can drift away from how merchants actually describe products. And when that happens, the classifier quality takes the blame for what is basically a taxonomy debt problem.
So this post is about what Shopify did next: they built an AI multi-agent system that doesn’t just classify products, it actively improves the taxonomy labels themselves so the system stays agile as commerce changes.
The challenge: scaling taxonomy without losing accuracy
A taxonomy is a contract between three groups that rarely agree:
Merchants describing products the way they think about them
Customers searching and filtering with their own mental model
Platform systems trying to enforce structure so everything stays queryable and comparable
Now add the reality that commerce never sits still. New products appear. Old categories split. Entire verticals get reshaped by trends, tech and regulation. The taxonomy has to keep up or the platform drifts away from how people actually shop and sell.
Shopify frames the challenge as three problems.
The volume problem: manual updates can’t keep up
A global product taxonomy needs constant attention. Every new product type, emerging technology category and seasonal trend potentially triggers taxonomy updates.
Manual curation becomes a bottleneck because taxonomy work is not one change. It is usually a bundle: a category addition, a hierarchy decision, a set of attributes, naming alignment and a check for duplicates or conflicts.
For example, consider the emergence of categories like smart home devices or remote work equipment. Each category represents not just new categories but also entirely new attribute sets.
Smart home devices for instance need connectivity types, power requirements and compatibility. Those are specs that did not exist in the taxonomy before.
So the work isn’t a one-off. It’s continuous expansion and adjustment across a giant tree of concepts.
The expertise problem: every vertical has its own rules
Good taxonomy design is domain-heavy. You do not get it right by being generally smart. You get it right by knowing what matters in that product world. For example, there are nuanced differences between types of guitar pickups or appropriate attributes for skincare products.
A taxonomy team can’t realistically maintain deep expertise across every vertical that merchants sell into. But if the taxonomy is inconsistent or poorly structured, merchants pay for it through reduced discoverability, suboptimal search results and ineffective filters for customers.
The consistency problem: one concept, five different labels
As the taxonomy grows organically, inconsistencies creep in:
similar concepts represented differently across categories
naming conventions inconsistent
discrepancies between merchant categorization and customer expectations
Those inconsistencies compound. Merchants get confused when listing. Customers get frustrated when filtering and comparing. And the classifier quality drops because labels stop being reliably meaningful across the tree.
This is the part most teams underestimate. In a taxonomy, small inconsistencies behave like small data quality issues: they don’t stay small.
From manual taxonomy work to agent-led evolution
Shopify’s taxonomy management evolved from a manual workflow into an AI-driven system.
The old way: Expert review, slow throughput
The traditional pattern is familiar:
domain experts analyze product data
identify gaps or inconsistencies
propose changes
implement changes via careful review
It ensures quality but it also creates bottlenecks.
The biggest problem was the reactive nature of it: Shopify would only recognize the need for new categories or attributes after merchants began listing products that didn’t fit. By then, the system had already missed chances to give merchants and customers a better experience.
So even when you do great manual work, you’re always late.
The breakthrough: Two lenses, one system
Advanced language models opened a door: not to replace human experts, but to augment them with scale and consistency.
The key insight was that taxonomy improvement comes from two different angles:
structural analysis: the logical structure of the taxonomy, gaps in hierarchies, missing relationships
product-driven analysis: what real product data says merchants actually sell and how they describe it
Each angle catches different issues. Shopify’s breakthrough was combining them into a system that can continuously propose improvements then filter them through quality checks before human review.
Inside the system: How the agents work
The new architecture rests on three principles:
specialized analysis
intelligent coordination
quality assurance
And the intent is clear: continuous evolution, not one-time taxonomy construction.
What’s different: continuous evolution, not one-time creation
AI’s been used for product categorisation and one-off taxonomy builds for a while. The difference here is instead of building it once and hoping it holds, Shopify uses specialised AI agents to keep the taxonomy evolving continuously. There are 3 core components to this approach:
1- Real product grounding: The system integrates actual merchant product data so proposals reflect how merchants describe and categorize products. This keeps decisions grounded in commerce reality rather than only theory.
In other words: if merchants are consistently describing a differentiator, it probably belongs in the taxonomy, even if it offends someone’s idea of a “pure” category tree.
2- Multi-agent specialization: Multiple specialized agents run different analyses. One focuses on structural consistency. Another focuses on product-driven insights. Then those outputs are synthesized. The claim here is that the combination finds improvements that neither agent would find alone.
That makes sense structurally. Taxonomy is both a graph problem and a language problem.
3- Sophisticated equivalence discovery: This is the most interesting component. detecting equivalence relationships where a specific category equals a broader category filtered by attribute values.
This matters because merchants should be able to organize their catalogs however they want, while the platform still understands what products ‘mean’ underneath the merchant’s choices.
So instead of forcing everyone into one rigid structure, Shopify tries to learn mappings that preserve flexibility and still support search, recommendations, and analytics.
Architecture flow
The AI agent workflow works like this:
enable agents to explore the taxonomy
run multi-stage analysis (structural + product-driven)
synthesize and resolve conflicts
detect equivalences
run automated QA using judges
send refined proposals to humans
update the taxonomy in production
Enabling agent-taxonomy interaction
Before agents can improve anything, they need to ‘read’ the taxonomy like a human would.
Shopify implemented a system that allows agents to:
search for related categories
examine hierarchical relationships
verify whether proposed changes conflict with existing elements
A good example: an agent analyzing guitar-related categories can explore the full musical instruments hierarchy, inspect related attributes across instruments and look for patterns that suggest better structure.
In other words, the agent doesn’t just look at one node. It roams the neighborhood.
The pipeline: specialised agents, staged decisions
For the AI Agent to be work properly, different specialised agents are at work to provide specific insights:
Structural analysis: This agent looks at the taxonomy itself for logical consistency, completeness, gaps in category hierarchies, naming convention inconsistencies and opportunities to reorganize related concepts.
It operates purely on the taxonomy structure and aims to keep the whole thing coherent.
Product-driven analysis: This agent integrates real merchant data and examines how products are described and categorized on the platform.
Specifically, it looks at patterns in product titles, product descriptions and merchant-defined categories. The goal is to find gaps between how merchants think about products and how the taxonomy represents them.
This is an important distinction. A taxonomy can be structurally perfect and still be useless if it doesn’t match merchant reality.
Intelligent synthesis: Now we have two streams of recommendations:
structure-driven improvements
product-driven improvements
They can conflict. They can overlap. They can propose redundant changes.
The synthesis step merges insights, resolves conflicts, and eliminates redundancies. And sometimes the best answer is not pick one, it’s combine both.
Equivalence detection: This agent solves a practical commerce problem: merchants want flexibility but platform systems need consistency.
Consider golf shoes:
Merchant A uses a specific ‘Golf Shoes’ category
Merchant B uses ‘Athletic Shoes’ with an ‘Activity Type = Golf attribute
Both are valid for the merchant. But search, recommendations and analytics benefit from understanding these represent the same product set.
So the system detects attribute-based equivalences of the form:
specific category = broader category + one or more attribute filters
This lets merchants organize however makes sense for their business while keeping platform intelligence consistent across different catalog structures.
If you’ve ever tried to do cross-merchant analytics at scale, you can probably feel why Shopify cared enough to build an entire agent for this.
Automated QA: judges before humans
After proposals are generated, Shopify adds automated QA through specialized AI judges.
These judges evaluate proposed changes using reasoning capabilities and taxonomy design principles to filter and refine suggestions before human review.
The important detail is that evaluation differs by change type:
adding new attributes
creating category hierarchies
modifying existing structures
Different changes require different criteria, so one generic ‘judge prompt’ would be weak. So instead, they use domain-specific judges.
An electronics-focused judge applies electronics expertise. A musical instruments judge applies that domain’s patterns and rules. The goal is consistent domain-aware evaluation across verticals.
Results
The system can analyze taxonomy branches in parallel, identifying improvement opportunities that used to take weeks of manual work.
Where experts might analyze a few categories per day, the system can evaluate hundreds of categories, checking both:
structural consistency
alignment with real product data
This matters most for emerging product categories. When new product types become popular on the platform, the system can quickly identify taxonomy gaps and propose comprehensive solutions, instead of reactive patches that build up debt.
Quality improvements
The multi-agent design improves consistency and comprehensiveness because it combines two lenses:
structural analysis keeps hierarchy organization logical and consistent
product-driven analysis keeps categories and attributes aligned with merchant reality
The automated QA layer reduces iteration cycles by catching issues before human review and applying domain expertise consistently.
Example: mobile phone accessories and MagSafe compatibility
Product analysis identified that merchants frequently advertise “MagSafe support” for accessories such as chargers, cases and wallets.
So the agent proposed adding a boolean attribute: ‘MagSafe compatible.’
A specialized electronics judge evaluated the proposal and checked:
no duplicate attribute already exists
boolean type is appropriate
while brand-specific, MagSafe is treated as a legitimate technical standard similar to Bluetooth or Qi
The judge approved the attribute with 93% confidence, noting it would improve customer filtering for MagSafe-ready products.
This example matters because it demonstrates the full loop:
merchant reality creates a signal
the agent proposes a structured change
a domain judge validates it with rule checks and domain framing
humans get a higher quality proposal to review
Scaling development: from reactive fixes to proactive evolution
The biggest shift is strategic: taxonomy development becomes proactive, not reactive.
Instead of waiting for a merchant pain point or a platform limitation to trigger a change, the system can identify and address gaps earlier.
The system can also reason over the entire taxonomy structure, which supports cross-category consistency. That helps avoid the fragmentation you get when teams fix issues in isolation.
To validate the approach, they applied it to a specific area: Electronics > Communications > Telephony (called “Telephony AI” in their analysis) and compared it against their previous manual expansion method.
As you can see from the chart, the AI-assisted method can compress years of work into weeks for the taxonomy area if the agents are applied across all verticals.
The full scoop
To learn more about this, check Shopify's Engineering Blog post on this topic
If you are already subscribed and enjoyed the article, please give it a like and/or share it others, really appreciate it 🙏








Great example of AI -taxonomy scaling. I use this example to show how vector databases can be guided by an existing framework to classify things.