Radio and PodcastRadio and PodcastLive Radio & Podcasts
Are AI Benchmarks Telling The Full Story? [SPONSORED] (Andrew Gordon and Nora Petrova - Prolific) artwork
Technology

Are AI Benchmarks Telling The Full Story? [SPONSORED] (Andrew Gordon and Nora Petrova - Prolific)

Machine Learning Street Talk by Machine Learning Street Talk (MLST)

Dec 20, 202500:16:04Technology

Is a car that wins a Formula 1 race the best choice for your morning commute? Probably not. In this sponsored deep dive with Prolific, we explore why the same logic applies to Artificial Intelligence. While models are cu...

About This Episode

Are AI Benchmarks Telling The Full Story? [SPONSORED] (Andrew Gordon and Nora Petrova - Prolific) is an episode from Machine Learning Street Talk by Machine Learning Street Talk (MLST). Is a car that wins a Formula 1 race the best choice fo...

Podcast

This episode belongs to Machine Learning Street Talk.

Listen Online

Use the player on this page to stream the episode online.

Episode Details

Published Dec 20, 2025, 00:16:04 long, audio available.

Questions About This Episode

What is Are AI Benchmarks Telling The Full Story? [SPONSORED] (Andrew Gordon and Nora Petrova - Prolific) about?

Is a car that wins a Formula 1 race the best choice for your morning commute? Probably not. In this sponsored deep dive with Prolific, we explore why the same logic applies to Artificial Intelligence. While models are currently shattering records on technical exams, they often fail the most important test of all: **the human experience.** Why High Benchmark Scores Don’t Mean Better AI Joining us are **Andrew Gordon** (Staff Researcher in Behavioral Science) and **Nora Petrova** (AI Researcher) from **Prolific**. They reveal the hidden flaws in how we currently rank AI and introduce a more rigorous, "humane" way to measure whether these models are actually helpful, safe, and relatable for real people. --- Key Insights in This Episode: * *The F1 Car Analogy:* Andrew explains why a model that excels at the "Humanities Last Exam" might be a nightmare for daily use. Technical benchmarks often ignore the nuances of human communication and adaptability. * *The "Wild West" of AI Safety:* As users turn to AI for sensitive topics like mental health, Nora highlights the alarming lack of oversight and the "thin veneer" of safety training—citing recent controversial incidents like Grok-3’s "Mecha Hitler." * *Fixing the "Leaderboard Illusion":* The team critiques current popular rankings like Chatbot Arena, discussing how anonymous, unstratified voting can lead to biased results and how companies can "game" the system. * *The Xbox Secret to AI Ranking:* Discover how Prolific uses *TrueSkill*—the same algorithm Microsoft developed for Xbox Live matchmaking—to create a fairer, more statistically sound leaderboard for LLMs. * *The Personality Gap:* Early data from the **Humane Leaderboard** suggests that while AI is getting smarter, it is actually performing *worse* on metrics like personality, culture, and "sycophancy" (the tendency for models to become annoying "people-pleasers"). --- About the HUMAINE Leaderboard Moving beyond simple "A vs. B" testing, the researchers discuss their new framework that samples participants based on *census data* (Age, Ethnicity, Political Alignment). By using a representative sample of the general public rather than just tech enthusiasts, they are building a standard that reflects the values of the real world. *Are we building models for benchmarks, or are we building them for humans? It’s time to change the scoreboard.* Rescript link: --- TIMESTAMPS: 00:00:00 Introduction & The Benchmarking Problem 00:01:58 The Fractured State of AI Evaluation 00:03:54 AI Safety & Interpretability 00:05:45 Bias in Chatbot Arena 00:06:45 Prolific's Three Pillars Approach 00:09:01 TrueSkill Ranking & Efficient Sampling 00:12:04 Census-Based Representative Sampling 00:13:00 Key Findings: Culture, Personality & Sycophancy --- REFERENCES: Paper: [00:00:15] MMLU [00:05:10] Constitutional AI [00:06:45] The Leaderboard Illusion [00:09:41] HUMAINE Framework Paper Company: [00:00:30] Prolific [00:01:45] Chatbot Arena Person: [00:00:35] Andrew Gordon [00:00:45] Nora Petrova Event: Algorithm: [00:09:01] Microsoft TrueSkill Leaderboard: [00:09:21] Prolific HUMAINE Leaderboard [00:09:31] HUMAINE HuggingFace Space [00:10:21] Prolific AI Leaderboard Portal Dataset: [00:09:51] Prolific Social Reasoning RLHF Dataset Organization: [00:10:31] MLCommons

Where can I listen to Are AI Benchmarks Telling The Full Story? [SPONSORED] (Andrew Gordon and Nora Petrova - Prolific)?

You can listen to Are AI Benchmarks Telling The Full Story? [SPONSORED] (Andrew Gordon and Nora Petrova - Prolific) online on Radio and Podcast. Open the player on this page to stream the available audio.

Which podcast is Are AI Benchmarks Telling The Full Story? [SPONSORED] (Andrew Gordon and Nora Petrova - Prolific) from?

Are AI Benchmarks Telling The Full Story? [SPONSORED] (Andrew Gordon and Nora Petrova - Prolific) is an episode from Machine Learning Street Talk by Machine Learning Street Talk (MLST).

How long is this episode?

This episode is 00:16:04 long.

When was this episode published?

This episode was published on Dec 20, 2025.

Can I save Are AI Benchmarks Telling The Full Story? [SPONSORED] (Andrew Gordon and Nora Petrova - Prolific) for later?

Yes. Use the heart button on the episode page to add it to your favorite episodes list.

Are there related episodes from Machine Learning Street Talk?

Yes. This page shows related episodes from Machine Learning Street Talk when more episodes are available from the podcast feed.

Quick Answers About This Episode

Where can I listen to Are AI Benchmarks Telling The Full Story? [SPONSORED] (Andrew Gordon and Nora Petrova - Prolific)?

You can listen to Are AI Benchmarks Telling The Full Story? [SPONSORED] (Andrew Gordon and Nora Petrova - Prolific) on this page when the episode audio is available from the podcast feed.

Which podcast is this episode from?

Are AI Benchmarks Telling The Full Story? [SPONSORED] (Andrew Gordon and Nora Petrova - Prolific) is from Machine Learning Street Talk by Machine Learning Street Talk (MLST).

What are the episode details?

Published Dec 20, 2025 and 00:16:04 long