Agent-native evaluation

Benchmarks for real AI agents, not just static prompts.

Starting with OpenClaw's Agent Memory benchmark suites: recall, privacy, lifecycle, scale, retrieval, runtime hooks, and live prompt-context injection.

Catalog

Benchmark suites

ID Suite Focus Tags Cases Score Status

Methodology

Local-first and evidence-backed

  • Every suite links back to JSON fixtures and report artifacts.
  • Deterministic checks are preferred unless an external judge is explicitly configured.
  • Privacy, source isolation, and credential exclusion are first-class gates.
  • Live runtime checks are separated from controlled harness checks.
Open full methodology →

Roadmap

From catalog to leaderboard

  • Add artifact checksums and verified-run badges.
  • Add model/agent leaderboard records.
  • Add submission flow for externally generated reports.
  • Add historical trends and run comparisons.
Open leaderboard preview →