Benedikt Stroebl

I'm a CS PhD student at Princeton University, advised by Arvind Narayanan. I cofounded a startup part of YC S25, where we trained long-horizon agents for forecasting.

My research focuses on agent evaluations and inference scaling in various domains including AI for science and cybersecurity. I have also published on agent safety. These days, I am interested in long-horizon agents and continual learning.

[Google Scholar] [GitHub] [X]

Selected Publications

AI Agents That Matter Sayash Kapoor*, Benedikt Stroebl*, Zachary S. Siegel, Nitya Nadgir, Arvind Narayanan
TMLR 2025
Safety devolution in AI agents Cheng Yu, Benedikt Stroebl, Diyi Yang, Orestis Papakyriakopoulos
NeurIPS 2025
Dynamic Risk Assessments for Offensive Cybersecurity Agents Boyi Wei*, Benedikt Stroebl*, Jiacen Xu, Joie Zhang, Zhou Li, Peter Henderson
NeurIPS 2025
Holistic agent leaderboard: The missing infrastructure for ai agent evaluation Sayash Kapoor*, Benedikt Stroebl*, + 29 more authors
ICLR 2026
The Limits of Inference Scaling Through Resampling Benedikt Stroebl*, Sayash Kapoor, Arvind Narayanan
ICLR 2026
CORE-Bench: Fostering the Credibility of Published Research Through a Computational Reproducibility Agent Benchmark Zachary S. Siegel, Sayash Kapoor, Nitya Nagdir, Benedikt Stroebl, Arvind Narayanan
TMLR 2025
Localized Cultural Knowledge is Conserved and Controllable in Large Language Models Veniamin Veselovsky*, Berke Argin*, Benedikt Stroebl*, Chris Wendler, Robert West, James Evans, Thomas L. Griffiths, Arvind Narayanan
arXiv:2504.10191 (2025)
Hindsight Merging: Diverse Data Generation with Language Models Veniamin Veselovsky*, Benedikt Stroebl*, Gianluca Bencomo*, Dilip Arumugam, Lisa Schut
UAI 2025
The Book of Life approach: Enabling richness and scale for life course research Mark D. Verhagen, Benedikt Stroebl, Tiffany Liu, Lydia T. Liu, Matthew J. Salganik
arXiv preprint 2507.03027 (2025)
Investigating vulnerabilities of GPS trip data to trajectory-user linking attacks Benedikt Stroebl*, Alexandra Kapp
Journal of Location Based Services (2025)

(* indicates equal contribution)

Open Source Projects

HAL Harness A standardized evaluation harness for reproducible agent evaluations across various benchmarks. Powered the more than 21,730 agent rollouts across 9 models and 9 benchmarks in coding, web navigation, science, and customer service (~$40,000 in compute) for the official HAL paper.

Open Source Contributions

Harbor Core contributor to the agent evaluation framework (and its cookbook, skills, and datasets). Contributed core framework abstractions (harbor RL; trial orchestration; sandboxing). Creator of Harbor Reward Kit and the Harbor Cookbook repo (reference implementations for RL training, computer-use agents, etc.).
smolagents General contributor to Hugging Face's agent framework.
GEPA Contributed async optimizer support in the reflective prompt optimizer.
HELM General contributor to Stanford CRFM's language model evaluation framework.

Blog Posts

Is AI progress slowing down? Making sense of recent technology trends and claims. Arvind Narayanan, Benedikt Stroebl, Sayash Kapoor
AI Snake Oil (2024)
AI leaderboards are no longer useful. It's time to switch to Pareto curves. Sayash Kapoor*, Benedikt Stroebl*, Arvind Narayanan
AI Snake Oil (2024)

Workshops

Workshop on Useful and Reliable AI Agents Princeton University. 600+ attendees. Virtual Workshop. August 2024.

Talks & Press

With Imperfect Verifiers, Scale Fails. Nnamdi Iregbulem from Lightspeed Venture Partners. Invited Podcast. August 2025.
AI Agent Benchmarks Paper Club | AI Tinkerers AI Tinkerers. Invited Talk. May 2025.
Building and evaluating AI agents that matter. AWS Applied Scientists. Invited talk. March 2025.
AI agents that matter. Weaviate Podcast. Podcast. September 2024.
AI agent benchmarks are misleading, study warns. VentureBeat. News article. June 2024.
The perils of evaluating AI agents. Meta (Core Applied Sciences). Invited talk. May 2024.