What the Data Crowd Was Reading in September 2025
Tools, techniques and deep dives worth reading that I came across in September 2025.
Fellow Data Tinkerers
It’s time for another round-up on all things data!
But before that, I wanted to share with you what you could unlock if you share Data Tinkerer with just 1 more person.
There are 100+ resources to learn all things data (science, engineering, analysis). It includes videos, courses, projects and can be filtered by tech stack (Python, SQL, Spark and etc), skill level (Beginner, Intermediate and so on) provider name or free/paid. So if you know other people who like staying up to date on all things data, please share Data Tinkerer with them!
Without further ado, let’s get to the round up for September.
Data science & AI
Meta’s Data Scientist’s Framework for Navigating Product Strategy as Data Leaders (10 minute read)
Learn how Meta data scientists act as product leaders, adapting strategy across four quadrants of data availability and problem clarity.23 RAG Pitfalls and How to Fix Them (18 minute read)
maps out 23 common RAG pitfalls, from bad chunking to hallucinations and shows how to fix them for more reliable apps.Post-training 101 (40 minute read)
This is a hitchhiker’s guide to turning base LLMs into instruction-following models, covering SFT, RLHF, RLVR and how to actually evaluate them.The Kaggle Grandmasters Playbook: 7 Battle-Tested Modeling Techniques for Tabular Data (14 minute read)
Kaggle Grandmasters share 7 tricks for tabular data, from smarter EDA to stacking and pseudo-labeling, distilled from years of leaderboard wins.Writing effective tools for agents — with agents (16 minute read)
Anthropic shows how to design, evaluate and optimise tools with agents, turning Claude into both the user and the co-developer for better agent performance.Why language models hallucinate (8 minute read)
OpenAI folks argue hallucinations persist because benchmarks punish humility and reward blind guessing, urging evals that give credit for uncertainty instead of pushing models to confidently invent facts.Getting AI to Work in Complex Codebases (18 minute read)
Dex Horthy argues that AI coding can succeed in complex codebases today by using context engineering techniques like frequent intentional compaction and human-guided workflows.
How OpenAI uses Codex (10 minute read)
Check out the use cases developed for Codex by OpenAI and see how it can be better used.Coding as the epicenter of AI progress and the path to general agents (15 minute read)
highlights how GPT-5-Codex and coding agents mark a turning point, moving AI from flashy benchmarks to everyday software building that actually changes how we code.How Netflix Used Deep Learning to Slash Video Quality Control Time by 90% (9 minute read)
Learn how Netflix built a neural net trained on synthetic and real footage to spot hot pixels, cutting video quality control time by 90% and freeing creatives from tedious frame checks.
Data engineering
Revisiting Medallion Architecture (9 minute read)
revisits the Medallion Architecture, adding a Platinum layer for real-time, ML-driven use cases and warning against common pitfalls in misusing Bronze, Silver and Gold.Past years data engineering and current trends (10 minute read)
maps 2025 data engineering trends from caching layers and GenAI automations to SaaS stacks, GPU/ARM platforms, open table formats and AI-native warehouses.Understanding Apache Fluss (32 minute read)
Jack Vanlightly breaks down Apache Fluss, a new Flink table store that tackles low-latency changelogs and stitches real-time data with lakehouse tiers like Paimon and Iceberg.Streaming and the RAD Stack (10 minute read)
shares lessons from building streaming systems with the RAD stack (Rust, Arrow, DataFusion), showing big performance gains but also challenges in connectors, plugins and streaming semantics.Honest review of MotherDuck (8 minute read)
gives a no-BS review of MotherDuck, praising its seamless UX, easy Airflow integration and how it makes DuckDB in the cloud feel idiot-proof.Data is Political (5 minute read)
that succeeding in data roles is more than just data modeling and it requires the practitioners to builders, salespeople or servants depending on the context.
A good reminder byImplementing IAM as a Data Engineer: A Practical Example (7 minute read)
Robert Long’s piece shows how data engineers can design secure, least-privilege IAM for Azure Storage by defining personas, mapping roles and implementing everything cleanly with Terraform.The SELECT FOR UPDATE Trap Everyone Falls Into (8 minute read)
Anton Borisov shows whySELECT FOR UPDATE
kills concurrency in Postgres and argues most cases should useFOR NO KEY UPDATE
instead.
How Shopify Uses Change Data Capture to Serve Millions of Merchants (15 minute read)
Find out how Shopify rebuilt CDC with Debezium + Kafka to stream 100k events/sec at 400TB scale, cutting merchant data freshness from 24h to ~1h.
Data analysis and visualisation
Lessons on building an AI data analyst (13 minute read)
Pedro shares lessons from building an AI data analyst: text-to-SQL isn’t enough. Success needs a semantic layer, multi-agent planning, precise retrieval and hybrid model routing for real-world BI.One Perspective That Separates Good BI Developers from Great Ones (7 minute read)
argues great BI isn’t just about cramming dashboards with metrics and it’s more about designing them to drive decisions.
The Data Analyst’s Dilemma: Accuracy vs Speed (7 minute read)
The analyst’s dilemma: knowing when to aim for perfect accuracy and when “good enough” is all the business really needs.
Miscellaneous
Flooding the AI Frontier (7 minute read)
argues China’s flood of free open-weight LLMs mirrors its manufacturing playbook of boosting domestic capability while undercutting US labs’ ability to profit from frontier AI.How I got the highest score on ARC-AGI again swapping Python for English (9 minute read)
explains how swapping Python for plain English in his evolutionary test-time compute system set a new SoTA on ARC-AGI, edging models closer to true reasoning and generalisation.How can we get enough data to train a robot GPT? (11 minute read)
explores how scaling robot fleets, simulation and human video could close the data gap and make training a “Robot GPT” on trillions of tokens feasible.
Quick favor - need your take
Was there any standout article or topic from September I missed? Feel free to drop a comment or hit reply, even a quick line helps.
If you are already subscribed and enjoyed the article, please give it a like and/or share it others, really appreciate it 🙏
Keep learning
What the Data Crowd Was Reading in August 2025
Here are the highlights from August 👇
Data Science & AI – Circuits research gaps, causal world models vs OpenAI, Google’s label trick for 10,000× less data and context engineering for LLMs.
Data Engineering – 10-year truths of DE, Airflow executors explained, Meta’s dual warehouse agents and LinkedIn’s OpenConnect speeding model launches.
Data Analysis & BI – A 6-step “why metric dropped” framework, vibe analysis for storytelling and the catch-22 of automating analysis.
Miscellaneous – AI governance ≠ data governance, scale as the real disruptor and why AI’s IMO gold medal might not mean much.
What the Data Crowd Was Reading in July 2025
Here are the highlights from July 👇
Data Science & AI – LLMs dissected, Kimi K2, context engineering and Google’s MLE-STAR.
Data Engineering – LinkedIn replaces Kafka, Agentic AI explained, survival tips and semantic layers.
Data Analysis & BI – Metric Trees flop, BI metrics redefined, MCP reshapes tools, analysts show impact.
Miscellaneous – AI adoption stats, CDO game and a simulator for job jugglers.