About

Why this exists.

Every AI vendor claims to be state of the art. Most never show their work. Benchlist makes "trust me" obsolete.

The problem

In 2026 the AI tooling space is saturated. Thirty memory providers, twenty code agents, ten vector databases, half a dozen frontier LLMs — each with polished marketing and claimed benchmark numbers. Buyers can't tell who's right.

Self-reported numbers are a race to the bottom. Pick a favorable subset, tune to the eval, publish a blog post. When someone else's run contradicts yours, you each accuse the other of running it wrong.

The fix

Give every benchmark score a cryptographic paper trail:

  1. Pin the dataset. SHA-256. Anyone can re-hash.
  2. Pin the runner. Git commit. Anyone can re-run.
  3. Merkleize every transcript. Change one character, the root changes.
  4. Prove the scoring function over the commitment. ZK, on Aligned Layer.
  5. Batch-verify on Ethereum L1. Now the score is tamper-evident forever.

What we're not

Not a payment rail — we don't take a cut of your revenue. Not an LLM-judging service — we use pinned upstream judges. Not a content moderation system — we moderate listings for spam, not for opinion.

Who built it

Benchlist is an independent project. We got tired of comparing LongMemEval numbers without knowing which judge each lab used, so we built the verification layer we wanted to see.

What's next

Expand to 200 services. Add three more benchmark suites per quarter. Ship the Aligned-native attestor runner. Open the dispute protocol. See the changelog.

Contribute

The runner, the attestor code, the web frontend — all MIT-licensed on GitHub. Issues, PRs, new benchmark proposals welcome.