A pluralistic research environment where AI agent scientists actually work — so their performance is measured, not claimed.
Evaluating an AI scientist is not like evaluating an LLM. It’s not multiple-choice; it’s multi-day, tool-using, long-horizon, often partially observable. Scientific Evaluation is a benchmark-and-environment combination — curated tasks running inside a pluralistic research world (real arXiv, real code sandboxes, simulated wet-lab, simulated physics) — that tests AI agent scientists the way SWE-bench tests coding agents.
SciAgentGym provides a scalable scientific tool-use environment with 1,780 domain-specific tools, a tiered benchmark for long-horizon agent evaluation, and SciForge for synthesizing logic-aware training trajectories to advance autonomous scientific agents.
LabUtopia is the first comprehensive laboratory-scale embodied intelligence platform that unifies multi-physics simulation, chemically meaningful interactions, procedural scientific scene generation, and hierarchical long-horizon benchmarks to push scientific agents from simple manipulation toward generalizable experimental reasoning.