I'm the cto and
cofounder of a company part of YC S25.
Before that, I was a
PhD student at
Princeton University,
advised by
Arvind
Narayanan.
My research focuses on
AI agents, with a
focus on enhancing
their real-world
usefulness and
reliability. Part of
that is developing
rigorous evaluation
frameworks and
studying the
limitations of
inference scaling
techniques. I have also done work on making models speak multiple languages and be faithful to the underlying culture.
AI
Agents That
Matter
Sayash
Kapoor*,
Benedikt
Stroebl*,
Zachary S.
Siegel, Nitya
Nadgir, Arvind
Narayanan
Transactions on Machine Learning Research
(2025)
Safety
devolution in
AI
agents
Cheng Yu,
Benedikt
Stroebl, Diyi
Yang, Orestis
Papakyriakopoulos
Conference on Neural Information Processing Systems (NeurIPS)
(2025)
Dynamic
Risk
Assessments
for Offensive
Cybersecurity
Agents
Boyi Wei*,
Benedikt
Stroebl*, Jiacen
Xu, Joie Zhang,
Zhou Li,
Peter
Henderson
Conference on Neural Information Processing Systems (NeurIPS)
(2025)
HAL: The Holistic Agent Leaderboard for Centralized and Reproducible Agent Evaluation Benedikt Stroebl*, Sayash Kapoor*, Arvind Narayanan
Inference
Scaling
fLaws: The
Limits of LLM
Resampling
with Imperfect
Verifiers
Benedikt
Stroebl*, Sayash
Kapoor, Arvind
Narayanan
arXiv
preprint
2411.17501
(2024)
CORE-Bench:
Fostering the
Credibility of
Published
Research
Through a
Computational
Reproducibility
Agent
Benchmark
Zachary S.
Siegel, Sayash
Kapoor, Nitya
Nagdir, Benedikt
Stroebl, Arvind
Narayanan
Transactions on Machine Learning Research
(2025)
Localized
Cultural
Knowledge is
Conserved and
Controllable
in Large
Language
Models
Veniamin
Veselovsky*,
Berke Argin*,
Benedikt
Stroebl*, Chris
Wendler, Robert
West, James
Evans, Thomas L.
Griffiths,
Arvind
Narayanan
arXiv:2504.10191 (2025)
Hindsight
Merging:
Diverse Data
Generation
with Language
Models
Veniamin
Veselovsky*,
Benedikt
Stroebl*,
Gianluca
Bencomo*, Dilip
Arumugam, Lisa
Schut
The 41st
Conference on
Uncertainty in
Artificial
Intelligence (2025)
The
Book of Life
approach:
Enabling
richness and
scale for life
course
research
Mark D.
Verhagen,
Benedikt
Stroebl, Tiffany
Liu, Lydia T.
Liu, Matthew J.
Salganik
arXiv
preprint
2507.03027
(2025)
Investigating
vulnerabilities
of GPS trip
data to
trajectory-user
linking
attacks
Benedikt
Stroebl*,
Alexandra
Kapp
Journal of
Location Based
Services
(2025)
(* indicates equal contribution)
Is
AI progress
slowing down?
Making sense
of recent
technology
trends and
claims.
Arvind
Narayanan,
Benedikt
Stroebl, Sayash
Kapoor
AI Snake Oil
(2024)
AI
leaderboards
are no longer
useful. It's
time to switch
to Pareto
curves.
Sayash
Kapoor*,
Benedikt
Stroebl*, Arvind
Narayanan
AI Snake Oil
(2024)
Workshop on Useful and Reliable AI Agents Princeton University. 600+ attendees. Virtual Workshop. August 2024.
With Imperfect Verifiers, Scale Fails. Nnamdi Iregbulem from Lightspeed Venture Partners. Invited Podcast. August 2025.
AI Agent Benchmarks Paper Club | AI Tinkerers AI Tinkerers. Invited Talk. May 2025.
Building and evaluating AI agents that matter. AWS Applied Scientists. Invited talk. March 2025.
AI agents that matter. Weaviate Podcast. Podcast. September 2024.
AI agent benchmarks are misleading, study warns. VentureBeat. News article. June 2024.
The perils of evaluating AI agents. Meta (Core Applied Sciences). Invited talk. May 2024.