How Grab Shrunk Real-Time Queries from 5 Minutes to 1 with FlinkSQL and Kafka

With SQL as the interface, analysts and engineers can now explore streams and deploy pipelines in under 10 minutes.

Aug 21, 2025

Fellow Data Tinkerers!

Today we will look at how Grab made real-time processing faster for its users.

But before that, I wanted to share with you what you could unlock if you share Data Tinkerer with just 1 more person.

There are 100+ resources to learn all things data (science, engineering, analysis). It includes videos, courses, projects and can be filtered by tech stack (Python, SQL, Spark and etc), skill level (Beginner, Intermediate and so on) provider name or free/paid. So if you know other people who like staying up to date on all things data, please share Data Tinkerer with them!

Refer a friend

Now, with that out of the way, let’s get to real-time processing at Grab!

man in green and black jacket wearing black helmet — Photo by Kseniia Ilinykh on Unsplash

TL;DR

Situation

Grab’s early attempt at stream exploration used Zeppelin notebooks on Flink. It worked for single users but had major issues: version drift, 5-minute cold starts and poor integration with other platforms. Adoption stalled.

Task

Build a shared, low-latency, and production-ready way for analysts, engineers, and internal platforms to query and deploy Flink pipelines in real time without the friction of isolated notebooks.

Action

Replaced per-user Zeppelin clusters with a shared FlinkSQL gateway.
Added an integration layer: custom control plane + REST APIs handling auth, orchestration and B2B headless queries.
Built a query layer: SQL editor + Hive Metastore mapping Kafka topics to tables for easy querying.
Enabled productionisation: users express stream logic as SQL, configure connectors, set resources, and deploy pipelines in ~10 minutes.

Result

Adoption rose: more interactive queries, more production pipelines created.
Real-time use cases unlocked across fraud detection, model validation, and event debugging.
Lower barrier to entry for analysts and developers.

Use Cases

Fraud detection, model validation, event streaming, message validation

Tech Stack/Framework

Apache Flink, Hive Metastore, Apache Kafka, SQL

Explained further

About Grab

Grab is often called the Uber of Southeast Asia but that might be selling it short. What started as a ride-hailing app now powers food delivery, groceries, payments and even insurance all bundled into one super app. They run across over 800 cities in 8 Southeast Asian countries. Behind the rides, meals, and payments lies an enormous stream of events flowing through Grab’s systems.

Background

For a company with Grab’s scale, real-time processing is table-stakes. Apache Flink earned its reputation by handling streams with state, time and scale. The Grab team leaned into that and built an interactive FlinkSQL setup that covers the whole path from exploration to production. The goals were clear: speed up the first result, make upgrades less painful and let more people touch streaming data without needing to learn the guts of Flink.

This article walks through that journey. It starts with a notebook-first world, runs into the wall you’d expect, swaps in a shared FlinkSQL gateway then layers on an API, an interactive UI and a configuration-driven way to ship streaming SQL to production. The net effect is lower time-to-insight, fewer moving parts for users and less friction when the Flink team bumps versions.

The Zeppelin days

Last year the team introduced Zeppelin notebooks for Flink. That work was aimed at downstream users who wanted to explore data quickly. But as use cases matured, a few issues started showing up:

Version drift headaches

Zeppelin is maintained by a community that isn’t the Flink community. At the time of writing, Zeppelin supported Flink 1.17 while Flink itself was already at 1.20. That mismatch made upgrades awkward. Every version gap turns into glue code, testing and the sort of workaround nobody enjoys maintaining.

The cold start problem

The design spun up a Zeppelin cluster per user on demand. Cold starts took around five minutes before a notebook was usable. That’s a non-starter for teams who need quick reads on production streams. Unsurprisingly, uptake lagged.

Integration issues

Zeppelin worked fine for a solo developer. It didn’t play as nicely with other internal platforms. Dashboards and automated pipelines needed a way to pull aggregates out of Kafka without copy-pasting code or exporting data manually. The notebook lived off to the side which meant people needed a workaround every time they wanted to reuse results or plug the flow into existing tools. Even something as basic as sending real-time metrics into a monitoring system wasn’t seamless.

Enter FlinkSQL interactive

The team replaced the per-user Zeppelin clusters with a shared FlinkSQL gateway cluster. They trimmed features they didn’t need and kept what mattered for data access and collaboration. The architecture is split into three layers:

Compute layer
Integration layer
Query layer

Shared FlinkSQL gateway architecture (Source: Grab)

The flow is relatively straightforward:

A user opens the platform portal and submits a SQL query against data in the Kafka online store
The backend orchestrator creates a session
and submits the SQL to the FlinkSQL gateway through its REST API
The gateway packages the SQL into a Flink job and submits it to a Flink session cluster. Results are polled and streamed back to the query layer.

Now let’s get to each of the layers.

Compute layer

With the FlinkSQL gateway as the compute engine for ad-hoc queries, upgrades got easier. The gateway ships with the main Flink distribution, so the team doesn’t have to maintain shims between a Zeppelin cluster and the Flink cluster. One fewer custom adapter is one fewer source of drift.

Cold starts also dropped. Instead of booting a fresh Notebook world per user, everyone shares the same FlinkSQL cluster. No cluster boot during session init means the time to first results fell from roughly five minutes to about one minute. There’s still a small delay since task managers are provisioned on demand to balance availability with cost but the order-of-magnitude improvement is what users feel.

Integration layer

The integration layer is the glue. It sits between the user-facing query layer and the compute layer, handling session lifecycle, authentication, authorisation and orchestration across internal platforms.

The FlinkSQL gateway has a REST API which is useful for basic submission but the pattern required multiple POSTs to do anything nontrivial, including fetching results. To smooth that out, the team added a custom control plane with its own REST APIs on top of the gateway.

This control plane integrates with in-house auth systems. Each query is authenticated, a lightweight session is created and the plane manages communication with the Flink Session Cluster.

Example API request for running a FlinkSQL query

The integration layer also supports B2B needs with Headless APIs. Programs can POST a SQL query, receive an operation ID then use GET requests to receive paginated results for an unbounded query. That pattern lets internal platforms tap Kafka programmatically without learning Flink’s native interface. It’s a small thing that removes a lot of friction.

Query layer

APIs are great. Humans still like a screen. The team paired those APIs with an interactive UI that serves human workflows end to end.

Users land in a clean SQL editor inside the platform portal. A Hive Metastore catalog maps Kafka topics to tables, so analysts don’t need to decode stream internals. Pick a table, write SQL, hit run. The integration layer routes the query through the control plane to the gateway. Results show up in the UI within about a minute which is a very different experience from waiting out a five-minute cold start.

The flow is straightforward. Here’s an example query that pulls a simple sanity check on a stream most folks recognise:

Getting Titanic data from a Kafka stream with a short command

That setup unlocked common cases across teams:

Fraud analysts debug and spot patterns in suspect transactions in real time
Data scientists pull live signals to validate model behavior
Engineers confirm messages are structured correctly and delivered as expected

The key is that all three groups share the same mental model. Kafka topics look like tables. SQL does the heavy lifting. The UI stays out of the way.

From exploration to production

Exploration is good. You still need a way to ship. As more teams built on the online store and used the interactive tools to design pipelines expressed as SQL, the last mile became the next focus. The team built a way to create configuration-based stream processing pipelines with business logic expressed in a SQL statement.

Portal for FlinkSQL pipeline creation (Source: Grab)

There’s a portal for pipeline creation. Connectors are hosted for internal platforms like Kafka and feature stores. Users can pick what they need off the shelf, set configuration and deploy their streaming pipeline. The logic itself is a SQL statement. The example in the diagram is a simple filter on a Kafka stream with the filtered events written to a separate Kafka stream.

Users can define the parallelism and the resources for their Flink jobs. On submission, resources are provisioned and the pipeline is deployed. Behind the scenes, the platform manages the application JAR that parses the configuration and translates it into a Flink job graph which gets submitted to the cluster. The moving parts are hidden, which is the kindest thing you can do for your colleagues.

From start to finish, a user can deploy a stream processing pipeline to production in roughly ten minutes.

Why the approach works

Nothing here depends on a silver bullet. It’s the sum of cleaner defaults.

Shared gateway reduces cold starts and cuts version drift
Control plane wraps the gateway in APIs that match real workflows
Headless endpoints let other platforms query streams without glue code
Interactive UI makes SQL on Kafka feel like SQL on any other table
Config-driven deploys turn a working query into a production job without hand-rolling jars per team

Every step makes the system easier to touch, which draws in more users. Once the path from idea to production is short enough, people walk it.

The impact after release

After the release, the team saw an uptick in pipelines created and more interactive queries. That’s the outcome you want. More pipelines mean more use cases moved from we should to we did. More interactive queries mean people are actually using the thing when they need real-time answers.

The tradeoffs

The team removed some of what notebooks offered and focused on features that push data access to more people. That means a shared control plane and a gateway, not a personal compute island per user. In return, teams get faster first results, easier upgrades, and a route to production that matches how they already think about SQL.

Key takeaways

With a shared FlinkSQL gateway, a control plane that fits real workflows, an interactive UI and a configuration-driven deployment path, the team simplified how people work with streaming data.

The changes cut time to first result from about five minutes to about one minute, reduced the pain of version upgrades, and gave both humans and systems a clean way to query Kafka.

For users who just need an answer, the UI makes it feel like querying any other table. For developers and internal platforms, the headless APIs keep the loop tight. When a query deserves to live in production, the pipeline portal turns that SQL into a running job within ten minutes.

The full scoop

To learn more about this, check Grab's Engineering Blog post on this topic

If you are already subscribed and enjoyed the article, please give it a like and/or share it others, really appreciate it 🙏

Keep learning

Data Engineering

How Expedia Monitors 1000+ A/B Tests in Real Time with Flink and Kafka

Data Tinkerer

Jul 24

How Expedia Monitors 1000+ A/B Tests in Real Time with Flink and Kafka

Expedia had a problem: bad A/B tests were slipping through the cracks in the first 24 hours and hurting revenue. So they built a real-time circuit breaker using to monitor 1,000+ experiments.

If you want to learn how it works and what makes it tricky, check out this article!

Read full story

Data Engineering

How Bolt Reconciles €2B in Revenue Using Airflow, Spark and dbt

Data Tinkerer

Jul 3

How Bolt Reconciles €2B in Revenue Using Airflow, Spark and dbt

Bolt processes billions in payments across 45+ countries, but their reconciliation system doesn’t rely on magic (or prayers). It runs on batch jobs, Delta tables and smart modelling choices.
If you're into financial data engineering, this one’s worth your time.

Read full story

Data Tinkerer

How Grab Shrunk Real-Time Queries from 5 Minutes to 1 with FlinkSQL and Kafka

With SQL as the interface, analysts and engineers can now explore streams and deploy pipelines in under 10 minutes.

TL;DR

Situation

Task

Action

Result

Use Cases

Tech Stack/Framework

Explained further

About Grab

Background

The Zeppelin days

Enter FlinkSQL interactive

From exploration to production

Why the approach works

The impact after release

The tradeoffs

Key takeaways

The full scoop

Keep learning

How Expedia Monitors 1000+ A/B Tests in Real Time with Flink and Kafka

How Bolt Reconciles €2B in Revenue Using Airflow, Spark and dbt

Discussion about this post