Benedikt Stroebl

I'm the cto and cofounder of a company part of YC S25.

Before that, I was a PhD student at Princeton University, advised by Arvind Narayanan.

My research focuses on AI agents, with a focus on enhancing their real-world usefulness and reliability. Part of that is developing rigorous evaluation frameworks and studying the limitations of inference scaling techniques. I have also done work on making models speak multiple languages and be faithful to the underlying culture.

[Google Scholar] [GitHub] [X]

Selected Publications

AI Agents That Matter Sayash Kapoor*, Benedikt Stroebl*, Zachary S. Siegel, Nitya Nadgir, Arvind Narayanan
Transactions on Machine Learning Research (2025)
Safety devolution in AI agents Cheng Yu, Benedikt Stroebl, Diyi Yang, Orestis Papakyriakopoulos
Conference on Neural Information Processing Systems (NeurIPS) (2025)
Dynamic Risk Assessments for Offensive Cybersecurity Agents Boyi Wei*, Benedikt Stroebl*, Jiacen Xu, Joie Zhang, Zhou Li, Peter Henderson
Conference on Neural Information Processing Systems (NeurIPS) (2025)
HAL: The Holistic Agent Leaderboard for Centralized and Reproducible Agent Evaluation Benedikt Stroebl*, Sayash Kapoor*, Arvind Narayanan
Inference Scaling fLaws: The Limits of LLM Resampling with Imperfect Verifiers Benedikt Stroebl*, Sayash Kapoor, Arvind Narayanan
arXiv preprint 2411.17501 (2024)
CORE-Bench: Fostering the Credibility of Published Research Through a Computational Reproducibility Agent Benchmark Zachary S. Siegel, Sayash Kapoor, Nitya Nagdir, Benedikt Stroebl, Arvind Narayanan
Transactions on Machine Learning Research (2025)
Localized Cultural Knowledge is Conserved and Controllable in Large Language Models Veniamin Veselovsky*, Berke Argin*, Benedikt Stroebl*, Chris Wendler, Robert West, James Evans, Thomas L. Griffiths, Arvind Narayanan
arXiv:2504.10191 (2025)
Hindsight Merging: Diverse Data Generation with Language Models Veniamin Veselovsky*, Benedikt Stroebl*, Gianluca Bencomo*, Dilip Arumugam, Lisa Schut
The 41st Conference on Uncertainty in Artificial Intelligence (2025)
The Book of Life approach: Enabling richness and scale for life course research Mark D. Verhagen, Benedikt Stroebl, Tiffany Liu, Lydia T. Liu, Matthew J. Salganik
arXiv preprint 2507.03027 (2025)
Investigating vulnerabilities of GPS trip data to trajectory-user linking attacks Benedikt Stroebl*, Alexandra Kapp
Journal of Location Based Services (2025)

(* indicates equal contribution)

Blog Posts

Is AI progress slowing down? Making sense of recent technology trends and claims. Arvind Narayanan, Benedikt Stroebl, Sayash Kapoor
AI Snake Oil (2024)
AI leaderboards are no longer useful. It's time to switch to Pareto curves. Sayash Kapoor*, Benedikt Stroebl*, Arvind Narayanan
AI Snake Oil (2024)

Workshops

Workshop on Useful and Reliable AI Agents Princeton University. 600+ attendees. Virtual Workshop. August 2024.

Talks & Press

With Imperfect Verifiers, Scale Fails. Nnamdi Iregbulem from Lightspeed Venture Partners. Invited Podcast. August 2025.
AI Agent Benchmarks Paper Club | AI Tinkerers AI Tinkerers. Invited Talk. May 2025.
Building and evaluating AI agents that matter. AWS Applied Scientists. Invited talk. March 2025.
AI agents that matter. Weaviate Podcast. Podcast. September 2024.
AI agent benchmarks are misleading, study warns. VentureBeat. News article. June 2024.
The perils of evaluating AI agents. Meta (Core Applied Sciences). Invited talk. May 2024.