
How to Engineer AI Inference Systems with Philip Kiely - #766
Apr 30, 2026 - 54:51
Radio and PodcastLive Radio & PodcastsFetching episode details...
Radio and PodcastLive Radio & Podcasts
Today, we're joined by Jacob Buckman, co-founder and CEO of Manifest AI to discuss achieving long context in transformers. We discuss the bottlenecks of scaling context length and recent techniques to overcome them, incl...
Recurrence and Attention for Long-Context Transformers with Jacob Buckman - #750 is an episode from The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence) by TWIML. Today, we're joined by Jacob Buckman, co-f...
This episode belongs to The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence).
Use the player on this page to stream the episode online.
Published Oct 7, 2025, 57:23 long, audio available.
Today, we're joined by Jacob Buckman, co-founder and CEO of Manifest AI to discuss achieving long context in transformers. We discuss the bottlenecks of scaling context length and recent techniques to overcome them, including windowed attention, grouped query attention, and latent space attention. We explore the idea of weight-state balance and the weight-state FLOP ratio as a way of reasoning about the optimality of compute architectures, and we dig into the Power Retention architecture, which blends the parallelization of attention with the linear scaling of recurrence and promises speedups of >10x during training and >100x during inference. We review Manifest AI’s recent open source projects as well: Vidrial—a custom CUDA framework for building highly optimized GPU kernels in Python, and PowerCoder—a 3B-parameter coding model fine-tuned from StarCoder to use power retention. Our chat also covers the use of metrics like in-context learning curves and negative log likelihood to measure context utility, the implications of scaling laws, and the future of long context lengths in AI applications. The complete show notes for this episode can be found at
You can listen to Recurrence and Attention for Long-Context Transformers with Jacob Buckman - #750 online on Radio and Podcast. Open the player on this page to stream the available audio.
Recurrence and Attention for Long-Context Transformers with Jacob Buckman - #750 is an episode from The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence) by TWIML.
This episode is 57:23 long.
This episode was published on Oct 7, 2025.
Yes. Use the heart button on the episode page to add it to your favorite episodes list.
Yes. This page shows related episodes from The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence) when more episodes are available from the podcast feed.
You can listen to Recurrence and Attention for Long-Context Transformers with Jacob Buckman - #750 on this page when the episode audio is available from the podcast feed.
Recurrence and Attention for Long-Context Transformers with Jacob Buckman - #750 is from The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence) by TWIML.
Published Oct 7, 2025 and 57:23 long