Lost in the Middle (The Agents Season, Episode 3)
May 4, 2026 - 00:19:44
Radio and PodcastLive Radio & PodcastsHow do you know if a new AI model is actually better than the last one? It turns out answering that question is a lot messier than it sounds. This week we dig into the world of LLM benchmarks — the standardized tests use...
Benchmarking AI Models is an episode from Linear Digressions by Ben Jaffe and Katie Malone. How do you know if a new AI model is actually better than the last one? It turns out answering that question is a lot messier than it sounds. This w...
This episode belongs to Linear Digressions.
Use the player on this page to stream the episode online.
Published Mar 30, 2026, 00:29:55 long, audio available.
How do you know if a new AI model is actually better than the last one? It turns out answering that question is a lot messier than it sounds. This week we dig into the world of LLM benchmarks — the standardized tests used to compare models — exploring two canonical examples: MMLU, a 14,000-question multiple choice gauntlet spanning medicine, law, and philosophy, and SWE-bench, which throws real GitHub bugs at models to see if they can fix them. Along the way: Goodhart's Law, data contamination, canary strings, and why acing a test isn't always the same as being smart.
You can listen to Benchmarking AI Models online on Radio and Podcast. Open the player on this page to stream the available audio.
Benchmarking AI Models is an episode from Linear Digressions by Ben Jaffe and Katie Malone.
This episode is 00:29:55 long.
This episode was published on Mar 30, 2026.
Yes. Use the heart button on the episode page to add it to your favorite episodes list.
Yes. This page shows related episodes from Linear Digressions when more episodes are available from the podcast feed.
You can listen to Benchmarking AI Models on this page when the episode audio is available from the podcast feed.
Benchmarking AI Models is from Linear Digressions by Ben Jaffe and Katie Malone.
Published Mar 30, 2026 and 00:29:55 long