AGI, Inc.: REAL Bench: benchmark design, reproducibility, and scores

What is REAL Bench?

REAL Bench (REAL) is a public, high‑fidelity benchmark and testbed that provides a mini‑Internet of replicated popular websites for standardized evaluation of web AI agents, together with a public leaderboard.

What scope and scale does REAL Bench include?

REAL Bench includes sandbox replicas of popular site types across 11 replica sites and defines 112 standardized tasks for agent evaluation.

What metric and evaluation outputs does REAL Bench provide?

REAL Bench reports a REAL Score metric (task success rate), provides comprehensive metrics including retrieval/action outcome reward functions and state‑diff checking, and publishes a public leaderboard for submitted agent results.

How can researchers reproduce REAL Bench evaluations?

Researchers can reproduce REAL Bench evaluations using the evaluation harness available via the AGI SDK and by submitting results to the realevals.xyz leaderboard.

What published REAL Bench model scores are mentioned?

The REAL Bench blog reports example REAL Scores including Claude‑3.7 Sonnet at 41.1%, Gemini‑2.5‑Pro at 38.4%, and AGI’s internal Agent‑0 at approximately 45%.

What AGI‑0 product benchmark results are presented as a case study?

AGI, Inc. presents AGI‑0 benchmark results showing browser control at 99.8% (tested on 1,000 shopping tasks via the REAL evals clone), mobile control at 97.4%, and multi‑OS/computer control scores around 76.2%.

Does AGI, Inc. publish an open benchmarking leaderboard online?

Yes, REAL Bench has a public leaderboard hosted at realevals.xyz where submitted agent results and scores are published.

What public materials accompany REAL Bench for reproducibility?

REAL Bench provides an evaluation harness through the AGI SDK, a public leaderboard at realevals.xyz, and links to a paper and reproducible artifacts referenced from the blog.

Does AGI, Inc. publish benchmark task counts or site replicas for REAL Bench?

Yes; REAL Bench is described as including 11 replica sites and 112 standardized tasks in its benchmark suite.

What audiences is the REAL Bench intended for?

REAL Bench is intended for AI researchers, developer teams, benchmarkers, and organizations evaluating agent performance on real‑web tasks.