What the Data Crowd Was Reading in August 2025
Tools, techniques and deep dives worth reading that I came across in August 2025.
Fellow Data Tinkerers
It’s time for another round-up on all things data!
But before that, I wanted to share with you what you could unlock if you share Data Tinkerer with just 1 more person.
There are 100+ resources to learn all things data (science, engineering, analysis). It includes videos, courses, projects and can be filtered by tech stack (Python, SQL, Spark and etc), skill level (Beginner, Intermediate and so on) provider name or free/paid. So if you know other people who like staying up to date on all things data, please share Data Tinkerer with them!
Without further ado, let’s get to the round up for August.
Data science & AI
The Circuits Research Landscape: Results and Perspectives (39 minute read)
Researchers from Anthropic, Google DeepMind and other companies tested easy-to-use ‘attribution graphs’ that show the paths models take to reach an output. They found real wins (seeing multi-step reasoning) plus gaps (fragile, local views) and shared tools, tips and next steps to make the analysis broader and more reliable.
How a Tiny AI Startup is beating OpenAI by Redefining Intelligence (31 minute read)
does a great job showing how Verses AI’s Axiom redefines intelligence by building causal world models. It runs 140× faster, 5,000× cheaper than GPT and works on edge hardware.Breaking Down Context Engineering (15 minute read)
breaks down context engineering as the craft of feeding AI agents just enough of the right info. Too much or messy context and they derail.Achieving 10,000x training data reduction with high-fidelity labels (9 minute read)
The team at Google show how they cut LLM fine-tuning from 100K to a few hundred curated samples, boosting expert alignment by up to 65% and slashing training costs.
Building Agents for Small Language Models: A Deep Dive into Lightweight AI (18 minute read)
A hands-on guide to building reliable local AI agents with small open models that favors simplicity, safety layers and code-driven logic over fancy chain-of-thought reasoning.Why Stacking Sliding Windows Can't See Very Far (20 minute read)
Sliding-window attention looks like it should scale to huge contexts but in practice information fades fast and residuals lock models into a short memory. Most models only really recall 1–2× the window size.LLM Evaluation: Practical Tips at Booking.com (11 minute read)
Booking.com’s team shares how they built a Judge-LLM framework: start with a carefully labeled golden dataset, train a strong model to mimic human judgments, then use it to evaluate other LLMs at scale
From GPT-2 to gpt-oss: Analyzing the Architectural Advances (27 minute read)
unpacks OpenAI’s gpt-oss release, tracing its journey from GPT-2 roots to today’s more efficient, open-weight models that rival Qwen3.Ranking the Chinese Open Model Builders (11 minute read)
and rank 19 Chinese AI labs from DeepSeek and Qwen to up-and-comers like Kimi and Zhipu, mapping who’s shaping the country’s open-model race.The "Duolingo Model" for Retention (7 minute read)
explains that the “Duolingo Model” treats retention as state transitions rather than a single churn metric, giving teams earlier signals, richer insights and scenario planning.How Uber Built an AI Agent That Answers Financial Questions in Slack (12 minute read)
Everyone’s talking about AI agents but most examples are still in prototypes or slide decks. Here’s one running in production at Uber: Finch, a Slack-based AI agent that turns plain-language finance questions into governed real-time answers.
Data engineering
5 Things in Data Engineering That Still Hold True After 10 Years (10 minute read)
argues that tech stack changes but the fundamentals don’t. Quality data, good modeling and business alignment still decide who actually wins.
How to Succeed in Data Engineering Interviews (9 minute read)
and dive into how to succeed in data engineering interviews (Hint: It’s not name-dropping buzzwords)Where does your task run in Apache Airflow? (11 minute read)
breaks down Airflow executors, from Sequential to Kubernetes, showing how each trades simplicity, scalability, and isolation in running your tasks.Why Semantic Layers Matter and How to Build One with DuckDB (21 minute read)
Simon shows how to build a tiny but mighty semantic layer. Built here with simple YAML + Python on DuckDB for 20M NYC taxi rows.Creating AI agent solutions for warehouse data access and security (13 minute read)
Meta’s data warehouse now has agents on both sides: one helps users find and request data, the other enforces policies.OpenConnect: LinkedIn’s next-generation AI pipeline ecosystem (15 minute read)
LinkedIn unveils OpenConnect, a new AI pipeline ecosystem that cuts model launch times from 14 minutes to 30 seconds and now powers 100% of its AI workloads across 100k+ monthly executions.Full vs Incremental Pipelines (7 minute read)
makes clear that no connector or SaaS magic solves pipeline design. you’re always trading off simplicity, reliability and expense.Spotify Data Tech Stack (5 minute read)
shows how Spotify wrangles trillions of daily events with PubSub, Beam, Flyte and BigQuery to keep 38K pipelines and 5K dashboards running.How Grab Shrunk Real-Time Queries from 5 Minutes to 1 with FlinkSQL and Kafka (11 minute read)
Cold starts, version drift and clunky notebooks, Grab hit all the classic headaches of streaming at scale. Here’s how they fixed it with FlinkSQL + Kafka.
Data analysis and visualisation
Why Did the Metric Drop? (6 minute read)
and do a great job showing how a simple, well-designed dashboard turns a 6-step metric framework into clear insightsVibe Analysis (12 minute read)
sketches a future where vibe analysis cuts grunt work in half but leaves the real leverage in judgment, intuition and clear storytelling.Can analysis ever be automated? (16 minute read)
pointing out the catch-22 of AI analysis: unlike code or apps, you can’t “test” a chart. you either trust the math and the maker or you don’t.
A different take from
Miscellaneous
Mass Intelligence (12 minute read)
argues the real shift is happening quietly: AI has gone from rare to everywhere and that scale is the disruption.
Why AI's IMO gold medal is less informative than you think (13 minute read)
points out AI’s IMO gold medal is misleading cause the problems were easier than usual so it shows reliability rather than new reasoning skills.Evolving from Data to AI Governance (11 minute read)
explains that AI governance can’t just copy data governance cause AI moves faster, shifts dynamically and carries higher ethical and regulatory stakes.
Quick favor - need your take
Was there any standout article or topic from August I missed? Feel free to drop a comment or hit reply, even a quick line helps.
If you are already subscribed and enjoyed the article, please give it a like and/or share it others, really appreciate it 🙏
Keep learning
What the Data Crowd Was Reading in July 2025
Here are the highlights from July 👇
Data Science & AI – LLMs dissected, Kimi K2, context engineering and Google’s MLE-STAR.
Data Engineering – LinkedIn replaces Kafka, Agentic AI explained, survival tips and semantic layers.
Data Analysis & BI – Metric Trees flop, BI metrics redefined, MCP reshapes tools, analysts show impact.
Miscellaneous – AI adoption stats, CDO game, and a simulator for job jugglers.