I'm a CS PhD student at
Princeton University,
advised by
Arvind Narayanan.
I cofounded a startup part of YC S25, where
we trained long-horizon agents for forecasting.
My research focuses on agent evaluations and
inference scaling in various domains including
AI for science and cybersecurity. I have also
published on agent safety. These days, I am
interested in long-horizon agents and continual
learning.
AI
Agents That
Matter
Sayash
Kapoor*,
Benedikt
Stroebl*,
Zachary S.
Siegel, Nitya
Nadgir, Arvind
Narayanan
TMLR 2025
Safety
devolution in
AI
agents
Cheng Yu,
Benedikt
Stroebl, Diyi
Yang, Orestis
Papakyriakopoulos
NeurIPS 2025
Dynamic
Risk
Assessments
for Offensive
Cybersecurity
Agents
Boyi Wei*,
Benedikt
Stroebl*, Jiacen
Xu, Joie Zhang,
Zhou Li,
Peter
Henderson
NeurIPS 2025
Holistic agent leaderboard: The missing infrastructure for ai agent evaluation
Sayash
Kapoor*,
Benedikt
Stroebl*, + 29 more authors
ICLR 2026
The Limits of Inference Scaling Through Resampling
Benedikt
Stroebl*, Sayash
Kapoor, Arvind
Narayanan
ICLR 2026
CORE-Bench:
Fostering the
Credibility of
Published
Research
Through a
Computational
Reproducibility
Agent
Benchmark
Zachary S.
Siegel, Sayash
Kapoor, Nitya
Nagdir, Benedikt
Stroebl, Arvind
Narayanan
TMLR 2025
Localized
Cultural
Knowledge is
Conserved and
Controllable
in Large
Language
Models
Veniamin
Veselovsky*,
Berke Argin*,
Benedikt
Stroebl*, Chris
Wendler, Robert
West, James
Evans, Thomas L.
Griffiths,
Arvind
Narayanan
arXiv:2504.10191 (2025)
Hindsight
Merging:
Diverse Data
Generation
with Language
Models
Veniamin
Veselovsky*,
Benedikt
Stroebl*,
Gianluca
Bencomo*, Dilip
Arumugam, Lisa
Schut
UAI 2025
The
Book of Life
approach:
Enabling
richness and
scale for life
course
research
Mark D.
Verhagen,
Benedikt
Stroebl, Tiffany
Liu, Lydia T.
Liu, Matthew J.
Salganik
arXiv
preprint
2507.03027
(2025)
Investigating
vulnerabilities
of GPS trip
data to
trajectory-user
linking
attacks
Benedikt
Stroebl*,
Alexandra
Kapp
Journal of Location Based Services (2025)
(* indicates equal contribution)
HAL Harness A standardized evaluation harness for reproducible agent evaluations across various benchmarks. Powered the more than 21,730 agent rollouts across 9 models and 9 benchmarks in coding, web navigation, science, and customer service (~$40,000 in compute) for the official HAL paper.
Is
AI progress
slowing down?
Making sense
of recent
technology
trends and
claims.
Arvind
Narayanan,
Benedikt
Stroebl, Sayash
Kapoor
AI Snake Oil
(2024)
AI
leaderboards
are no longer
useful. It's
time to switch
to Pareto
curves.
Sayash
Kapoor*,
Benedikt
Stroebl*, Arvind
Narayanan
AI Snake Oil
(2024)
Workshop on Useful and Reliable AI Agents Princeton University. 600+ attendees. Virtual Workshop. August 2024.
With Imperfect Verifiers, Scale Fails. Nnamdi Iregbulem from Lightspeed Venture Partners. Invited Podcast. August 2025.
AI Agent Benchmarks Paper Club | AI Tinkerers AI Tinkerers. Invited Talk. May 2025.
Building and evaluating AI agents that matter. AWS Applied Scientists. Invited talk. March 2025.
AI agents that matter. Weaviate Podcast. Podcast. September 2024.
AI agent benchmarks are misleading, study warns. VentureBeat. News article. June 2024.
The perils of evaluating AI agents. Meta (Core Applied Sciences). Invited talk. May 2024.