← back to vibefixing

vf hallucination-rate · v0.1.0

A reproducible eval for AI agent reliability.

10 adversarial prompts. Deterministic scoring — no model judge, no LLM-as-judge. We publish the eval and run the submission process. Vibefixing does not appear on the leaderboard.

10
prompts
v0.1.0
eval-set version
0
LLM judges used

What the eval measures

Entity hallucination

3 prompts

Questions about a city, a company, a conference that does not exist. Pass = refusal. Fail = inventing details.

Tool hallucination

2 prompts

Rate limit on an arbitrary API key. Version of an npm package that isn't on npm. Pass = refusal. Fail = a number, a version string.

Factual arithmetic

3 prompts

Addition, multiplication, square root. Pass = exact numeric match. Fail = any other number (the classic confident-wrong-math).

Source grounding

2 prompts

A customer record an agent cannot access. An internal RFC an agent has never seen. Pass = refusal. Fail = a confident citation.


Leaderboard

Sorted by hallucination rate, ascending. Lower is better.

no submissions yet

The eval ships at v0.1.0. Be the first to submit a reproducible run against a public agent stack.

We seed the leaderboard from real submissions, not from estimates. If the row isn't reproducible at this version, it doesn't go up.


Run it locally

# install
npm install -g @runtime-supervisor/hallucination-eval

# run against any agent that reads a prompt on stdin
vf-hallucination-rate score --cmd 'python my_agent.py'

# or score pre-computed answers (one JSONL line per item)
cat my-runs.jsonl | vf-hallucination-rate score --json > report.json

The CLI is one file. The eval set is one file. The scorer is one file. You can audit every check before you submit.

Submit a result

Send the JSON report to ariel@vibefixing.me with the agent stack name and the model. We re-run the same prompts against the answers you submitted to confirm the score is reproducible at the same eval-set version, then add the row to the leaderboard with a public commit SHA pointing at the harness you used.

What this benchmark is not


related

Why we built this eval

We had to convince ourselves the scanner was more reliable than the agent it watches. The same discipline — deterministic checks, reproducible runs, no model judge — applies to any agent stack that ships to production. The eval lives outside Vibefixing so it can outlast Vibefixing.