How Uber Cut Invoice Handling Time by 70% with GenAI (Without Ditching Humans)
What started as a messy mix of bots and spreadsheets is now a smart, scalable pipeline powered by LLMs and human-in-the-loop design.
Fellow Data Tinkerers!
This article was originally published on
around two weeks ago. who writes the newsletter suggested doing a guest post and I was more than happy to do it.You should check out his newsletter if you need to stay up to date on all the latest developments in AI. He is a machine in terms of keeping track of the latest news and publishing it quickly for his subscribers. So if staying updated on all things AI is your jam, give it a go!
Now, let’s get to today’s deep dive on how Uber used GenAI in their invoice processing workflow.
TL;DR
Situation
Uber was drowning in supplier invoices, many of which still required manual processing despite existing automations and rule-based systems. Errors, slow approvals and escalating costs were becoming a problem.
Task
Build a scalable, accurate and flexible system that could handle diverse invoice formats and languages without constant manual intervention or custom rule-writing.
Action
Uber developed a GenAI-powered document processing platform called TextSense. It combined OCR, GPT-4, business rule post-processing and a Human-in-the-Loop (HITL) review layer, all wrapped in a modular architecture.
Result
The system cut manual handling by 2x, improved invoice accuracy to 90%, reduced processing time by 70% and saved 25–30% in operational costs.
Use Cases
Document processing, invoice automation, workflow automation
Tech Stack/Framework
TextSense, Cadence, OpenAI GPT-4, OCR
Explained Further
Why invoices still suck in 2025
If you’ve ever been involved in invoice processing, you know it’s not exactly a glamorous job. And when you’re Uber, you are dealing with thousands of suppliers around the globe and a mountain of invoices showing up every day. It becomes more than a grind. It’s a bottleneck.
At Uber, invoices aren’t just a back-office task. They directly impact how fast they pay vendors and how well they avoid errors that later snowball into finance headaches.
But to understand how they got there, you first need to know what they were working with.
The pre-GenAI mess
Before GenAI came into the picture, invoice processing at Uber looked like a Frankenstein’s monster of automation and manual intervention. They had Robotic Process Automation (RPA). They had Excel uploads. They had rule-based systems duct-taped to self-serve supplier portals. But despite all that, a large chunk of invoices still needed a human in the loop just to parse the basics.
This wasn’t sustainable - not with the volume they were dealing with and definitely not if they wanted to improve accuracy and speed without throwing more people at the problem.
That’s where GenAI came in. Their goal wasn’t just to "automate more." It was to replace brittle systems with something adaptive. Something that could actually read invoices the way a human would across different formats, languages and layouts. And then extract clean, structured data with minimal help.
The next step? Mapping the actual flow from purchase to payment and spotting the choke points.
The invoice journey: from PO to payment
For context, Uber’s procurement flow is pretty standard:
A supplier gets onboarded (country, currency, PO requirements, etc.)
Uber employees raise a PO for goods/services
Supplier submits an invoice in PDF against the PO
The invoice goes through approvals
The supplier gets paid

But the real bottleneck was step 3, when suppliers submitted invoices in different formats and file types. So what did these formats actually look like? Time to follow the submission paths.
Two ways in, one big headache
Suppliers had two submission paths:
Self-Serve Portal: Suppliers log in, search for the correct PO and upload an invoice.
Email-based submission: Suppliers email an invoice to Uber. Those emails go into a ticketing system, get tagged based on rules and then they are processed using RPA or manually.

That second path is where most of the pain was coming from. But what exactly were these pain points?
Where it all falls apart
The problems could be broken down into two major categories: Business headaches and technical nightmares:
Business Headaches
Manual overload: Despite having “automated” options, too many invoices still needed someone to manually extract data.
Costs ballooned: More manual work = more hours = more dollars.
Slow handling times: Processing a single invoice could take way too long, especially if it needed review.
Error-prone: Humans make mistakes. Those mistakes sometimes led to payment issues, vendor frustration and messy reconciliations.
Technical Nightmares
Too many formats: Thousands of suppliers. Each with their own invoice layout.
25+ languages: This isn’t just an OCR problem, it’s an NLP problem.
Scanned docs and handwriting: Yep, people still send in grainy scans with pen annotations.
High field count: Each invoice had 15–20 fields plus multi-line item details, all of which needed to be right.
Multipage documents: Even more room for things to go wrong.
But wait, aren’t RPA bots supposed to take care of this?
Why RPA and rules weren’t enough
Adding one more rule for each new invoice format was like plugging holes in a sinking ship.
Let’s say you get a new invoice format from a supplier in Italy. With a rule-based system, someone has to sit down and write logic to handle the new layout. That logic has to be tested, deployed and maintained. Multiply that by every supplier, every country and every tweak and you’ve got a system that scales linearly with chaos.
RPA made sense in a simpler world. But Uber’s invoice world isn’t simple anymore. They needed to design a system that could learn patterns, generalize across formats and improve over time.
So what does a system like that actually look like? Uber laid down a few key principles.
Designing something that could actually scale
Instead of duct-taping more logic, Uber stepped back and asked: what should this system actually do? They designed the new invoice automation system with four key principles:
Accuracy: Extract clean data from messy formats using trained ML and GenAI models.
Scalability: Handle high volumes without an issue.
Flexibility: New formats? No problem. No need to rewrite a hundred if-else rules.
User experience: The system should not frustrate people doing the final review.
That’s why Uber didn’t stop at the backend. They designed a UI specifically for the humans reviewing GenAI’s output.
The interface lets users view the original PDF side-by-side with the extracted data. You can get through a review with just your eyes, not a dozen clicks and tab switches.
Design was only half the battle. The real lift came with turning it into a working platform.
Meet TextSense: Uber’s document intelligence engine
TextSense is Uber’s in-house document processing engine and is the real hero of this story. It’s modular, reusable and not invoice-specific. This means Uber can extend it to other document types down the line (contracts, receipts, etc.).
Here’s how it works, step-by-step:
1. Ingestion: Documents come in via emails, uploads or tickets. TextSense stores them in object storage for downstream processing.
2. Preprocessing: This step does OCR prep like resolution enhancement, format conversion, page separation and layout standardization.
3. OCR and CV: Uber’s Vision Gateway extracts the raw text even from blurry or handwritten scans.
4. AI/ML models: This is where LLMs step in. TextSense uses open-source and proprietary models to extract key fields. The goal isn’t just to extract text but to also understand the structure and meaning of invoices.
5. Post-Processing: Business logic kicks in. Validation, enrichment, PO cross-checking. This is where the messy real-world constraints are applied.
6. Human-In-The-Loop (HITL): Reviewers get a side-by-side UI with soft alerts and all the data in one place. The design favors eye movements over clicks, speeding up the review process.
7. Monitoring: Metrics like latency, accuracy by field and cost per invoice are all tracked. This feeds back into retraining decisions and model tuning.
The whole thing is tracked with KPIs like processing speed, accuracy and cost so it can be measured and improved.
So how does this system actually work in production? Let’s follow an invoice through the pipeline
How it all comes together
Now let’s walk through what actually happens when an invoice hits the system:
If an invoice is manually uploaded via UI (by an employee), it hits the backend which sends it to TextSense.
If it comes through email (sent by a supplier), a service reads the ticket, parses the email, grabs the attachment and passes everything to TextSense.
TextSense processes the file, validates fields, applies enrichment and prepares it for human review.
Once reviewed and approved, it’s pushed into the ERP for payment.
Cool workflow but what model should actually power it? It was time to test some.
Which model did the best job?
Uber used past invoice data and their corresponding attachments as ground truth. There were two main datasets:
Structured data: The fields they want to extract (like invoice number, total amount, etc.) that are already in their systems
Unstructured data: the raw text pulled out of PDF invoices
They used a year’s worth of invoices, split 90/10 between training and test.
Once the data was ready, they started training. They fine-tuned a few open-source LLMs like seq2seq, Llama 2, Flan T5 and ran them to figure out what worked best for invoice extraction. T5 looked decent at first and nailed the invoice headers (overall invoice information) with over 90% accuracy but couldn’t keep up with the line items.
It turned out fine-tuning helped the models mimic some business rules but it also made them hallucinate when they didn’t know what to do. Not ideal when you're dealing with financial data.
Then they tested GPT-4. It didn’t pick up Uber-specific patterns (not surprising) but it was great at pulling out what was actually in the document. So instead of trying to train the model to follow all their internal rules, Uber flipped the script: let GenAI extract everything it can from the raw doc, then pass the results through a post-processing layer that applies all the business logic before showing it to a human for review.
They compared the fine-tuned open-source model with GPT-4 and laid it out in a cost-benefit table:
Looking at the table, GenAI models (e.g. GPT 4) were the clear winners. While the open-source model slightly edged out header accuracy, GPT-4 did a much better job at line items.
With the winner picked, the question became: will it actually work under real-world pressure? Time to look at the results.
The payoff: accuracy, speed and savings
Accuracy was tracked at two levels:
Header fields: Invoice number, PO number, date, total
Line items: Description, quantity, unit price, line total
Some fields required exact matches (e.g. invoice number), others allowed fuzzy matching (invoice description). This granularity helps Uber track where models are strong and where they need improvement.
And what were the results when it was all said and done:
2x reduction in manual invoice handling
90% overall accuracy, with 35% of invoices hitting 99.5% accuracy and 65% above 80%
70% reduction in average handling time
25–30% cost savings compared to the old process
This wasn’t just a tech win. Reviewers weren’t spending hours digging through PDFs. The ERP integration became faster. Supplier trust improved.
So what did Uber learn from all this? Three takeaways stand out.
Lessons learned
Don’t fight complexity with rules: Rule-based systems might get you part of the way but they fall apart fast when formats, languages and layouts start to multiply.
Humans still have a role: Instead of aiming for a mythical 100% automation rate, Uber leaned into smart human review that kept the quality high
GenAI needs guardrails: GPT-4 performed well but post-processing logic, accuracy monitoring and feedback loops made it usable in production. The magic isn’t just in the model. It’s in the system around it.
The full scoop
To learn more about the implementation, check Uber’s post on this topic
Keep learning
How Reddit Scans 1M+ Images a Day to Flag NSFW Content Using Deep Learning
Reddit needed to flag NSFW images the second they were uploaded. They built a deep learning system that does exactly that; fast, scalable and battle-tested in prod. Here’s how it works.
How Walmart Automated 400+ Forecasts and Cut Runtime by Half
Learn how their Autotuning Framework slashed errors, halved processing time and scaled across hundreds of time series without manual tuning.
Very interesting to read. Definitely my biggest take away was the incorporation of a HITL in their process, kudos to Uber for not immediately running into replacing workers with AI and actually leverage AI for the easement of their human reviewers.