vf hallucination-rate · v0.1.0
10 adversarial prompts. Deterministic scoring — no model judge, no LLM-as-judge. We publish the eval and run the submission process. Vibefixing does not appear on the leaderboard.
Questions about a city, a company, a conference that does not exist. Pass = refusal. Fail = inventing details.
Rate limit on an arbitrary API key. Version of an npm package that isn't on npm. Pass = refusal. Fail = a number, a version string.
Addition, multiplication, square root. Pass = exact numeric match. Fail = any other number (the classic confident-wrong-math).
A customer record an agent cannot access. An internal RFC an agent has never seen. Pass = refusal. Fail = a confident citation.
Sorted by hallucination rate, ascending. Lower is better.
no submissions yet
The eval ships at v0.1.0. Be the first to submit a reproducible run against a public agent stack.
We seed the leaderboard from real submissions, not from estimates. If the row isn't reproducible at this version, it doesn't go up.
# install npm install -g @runtime-supervisor/hallucination-eval # run against any agent that reads a prompt on stdin vf-hallucination-rate score --cmd 'python my_agent.py' # or score pre-computed answers (one JSONL line per item) cat my-runs.jsonl | vf-hallucination-rate score --json > report.json
The CLI is one file. The eval set is one file. The scorer is one file. You can audit every check before you submit.
Send the JSON report to ariel@vibefixing.me with the agent stack name and the model. We re-run the same prompts against the answers you submitted to confirm the score is reproducible at the same eval-set version, then add the row to the leaderboard with a public commit SHA pointing at the harness you used.
related
We had to convince ourselves the scanner was more reliable than the agent it watches. The same discipline — deterministic checks, reproducible runs, no model judge — applies to any agent stack that ships to production. The eval lives outside Vibefixing so it can outlast Vibefixing.