Agent-native evaluation

Benchmarks for real AI agents, not just static prompts.

Starting with OpenClaw's Agent Memory benchmark suites: recall, privacy, lifecycle, scale, retrieval, runtime hooks, live prompt-context injection, decision quality, mutation correction, continuity, and memory health self-audits.

Open Test Lab View benchmark catalog

Catalog

Benchmark suites

ID	Suite	Focus	Tags	Cases	Score	Status

Methodology

Local-first and evidence-backed

Every suite links back to JSON fixtures and report artifacts.
Deterministic checks are preferred unless an external judge is explicitly configured.
Privacy, source isolation, and credential exclusion are first-class gates.
Live runtime checks are separated from controlled harness checks.

Open full methodology →

Roadmap

From catalog to leaderboard

Add artifact checksums and verified-run badges.
Add model/agent leaderboard records.
Add submission flow for externally generated reports.
Add historical trends and run comparisons.

Open leaderboard preview →