A CLI, three SDKs, one canonical run format. Ship an attested score in about three minutes.
Posting a test is a single HTTP request. No CLI required, no uploads, no dashboard to open. You call POST /v1/run with a (service, model, benchmark) tuple; we run it on a staked attestor, commit the Merkle root, submit the proof to Aligned Layer, and settle on Ethereum L1. The response includes a verify_url that’s live within ~3 minutes of proof verification.
curl -X POST https://api.benchlist.ai/v1/run \
-H "Authorization: Bearer $BENCHLIST_KEY" \
-H "Content-Type: application/json" \
-d '{
"service": "anthropic-claude",
"model": "claude-opus-4-7",
"benchmark": "mbpp",
"runs": 3
}'
# → 202 Accepted
# {
# "run_id": "run-8f3a...",
# "status": "queued",
# "est_seconds": 180,
# "charge": { "credits": 1, "usd": 5.00 },
# "verify_url": "https://benchlist.ai/verify/run-8f3a..."
# }
That’s the whole flow. The response updates its status as it moves queued → running → committed → proving → verified. Subscribe to a webhook on run.verified to ship badges or trigger downstream jobs. $5 per run deducts from your credit balance — including the Ethereum mainnet gas your proof settles under.
Prefer a typed SDK? We ship pip install benchlist, npm i @benchlist/sdk, and a Go client — see /sdk. Prefer a CLI for CI wiring? Keep reading.
Free to sign up. Email verification only — no card on signup, no activation fee, no subscription. Drop your email at /submit and we mail you a bl_live_… Bearer key. Your first attested test is on us.
$5 per attested test. Top up whenever with a credit pack (up to 33% off volume). Two payment paths, same outcome:
# Export the key and re-use anywhere
export BENCHLIST_KEY=bl_live_...
Rotate keys with POST /v1/keys/rotate. Issue scoped sub-keys per environment. Full auth reference: /api#auth.
The reference runner is a single pipx-installable Python package. It wraps the benchmark runner, the committer (Merkle/hash), and the Aligned submitter.
pipx install benchlist-runner
# OR npm global
npm i -g @benchlist/cli
Verify:
benchlist --version
# benchlist-runner 1.0.2 (sp1 v4.2.3, aligned-sdk v2.1.0)
Say you want to benchmark your LLM provider on MBPP.
export ANTHROPIC_API_KEY=sk-ant-...
benchlist run mbpp \
--service anthropic-claude \
--model claude-opus-4-7 \
--runs 3 \
--out claude-mbpp.json
The runner will:
~/.benchlist/datasets/)run.jsonTwo options. CLI publishes directly; web lets you paste JSON.
benchlist commit claude-mbpp.json
benchlist prove claude-mbpp.json --system sp1
benchlist submit claude-mbpp.json --network ethereum
# → batch_id: 0x3c5d...9a1b (waiting for verification...)
# → verified at block 22184921
benchlist publish claude-mbpp.json
# → https://benchlist.ai/verify/run-claude-mbpp-001
Paste the output of benchlist prove into /submit. We verify the proof against Aligned's batch explorer and publish within 2 minutes.
A service is an AI-adjacent product: an LLM API, a memory substrate, a code agent, a vector DB, etc. Each service has a stable ID (slug), a category, metadata, and a JSON schema.
Services don't host benchmark runs directly — runs reference the service by ID. This lets you update the service description or URL without invalidating historical scores.
A benchmark suite is defined by two hashes:
datasetHash: SHA-256 of the canonical evaluation setmethodologyHash: SHA-256 of the runner repo at a specific commitChange either, and you've created a new version of the benchmark. Old runs don't transfer. This prevents silent benchmark drift.
A run is a specific (service, model, config) executed against a specific benchmark suite. Every run produces:
The commitment is what actually gets signed and submitted to Aligned.
An attestor is a runner that executes benchmarks and signs results. The reference attestor (benchlist-runner-0) is operated by Benchlist itself, but anyone can join the registry by:
benchlist attestor init — generates an Ed25519 keypairPUT /attestors request with their pubkey + metadataMisconduct (upheld disputes) slashes the stake.
Aligned is a proof aggregation network that settles on Ethereum L1. Every commitment produced by a runner is packaged as a proof, submitted to Aligned's operator set, and verified on-chain. Once verified, the batch ID becomes the listing's credential.
See the integration spec for wire format.