Open-source benchmark

TRASHFIRE

The AI code review benchmark your LLM doesn't want to see.

42Projects
4,200+Planted Issues
30+Languages
-Runs
GitHub Results

How to run

Three steps. No scripts. You talk to your AI, it reviews the code, you score the result.

STEP 1 - SETUP (once)
$ git clone https://github.com/ktdmax/trashfire.git
$ cd trashfire/_scoring && npm install && cd ..
STEP 2 - REVIEW

Open your AI tool in the repo folder. Load your skills if you have any. Pick a vault:

Standard (3 vaults) Focused (1 vault) Ultimate (42 vaults)
Clone https://github.com/ktdmax/trashfire.git - read BENCHMARK_RUNNER.md for the full rules. Run: cd _scoring && npm install && cd .. && bash _scoring/create-blind-copy.sh. IMPORTANT: Do not grep for BUG markers. The review prompt is in _prompts/base-review.md (defines JSON output format, 6 categories: SEC/LOGIC/PERF/BP/SMELL/TRICKY, severity levels, and rules). Use it as Layer 0. Review these 3 vaults one by one: _blind/grog-shop/, _blind/tentacle-labs/, _blind/lechuck-crypt/. Read each file individually from top to bottom. Cover ALL categories: security, logic, performance, best practices, code smells, and tricky cross-module bugs. Report each finding separately as JSON. Upload results at trashfire.io/#score-section to score against ground truth.
3 vaults (Next.js + Python + C), ~1.5-3 hours, standard benchmark
STEP 3 - SCORE
$ curl -s -X POST https://trashfire.io/api/score \
  -H "Content-Type: application/json" \
  -d '{"project":"grog-shop","review":...}' | jq .

That's it. Or drag-and-drop your review.json below.

Works with any AI tool that can read files: Claude Code, Cursor, Windsurf, Codex, Gemini, Aider, ... The AI reads the code itself using its own tools. No subprocess, no API proxy, no file pasting. That's what makes it a real test.

What we test

Six dimensions of code review ability, weighted by importance.

35%
Security
OWASP Top 10, CWE-mapped vulnerabilities
25%
Tricky
Cross-module bugs, edge cases, timing
20%
Logic
Off-by-one, race conditions, async
10%
Performance
N+1, memory leaks, blocking I/O
5%
Best Practice
Hardcoded values, swallowed errors
5%
Code Smell
Dead code, duplication, god functions

Results

42 Vaults

Three benchmark tiers. Pick your level.

STANDARD BENCHMARK
3 Vaults - ~3 hours
grog-shop Next.js/TypeScript - Web
tentacle-labs Python/Flask - API
lechuck-crypt C/OpenSSL - Systems
The official benchmark. Three stacks, three worlds. Your composite score is the average of all three.
FOCUSED
1 Vault - ~1 hour
Pick any vault. Test a specialized skill. Python expert? Try tentacle-labs. Rust dev? Try guybrush-ledger. Solidity? spiffy-anchor.
ULTIMATE
42 Vaults - ~42 hours
All 42 vaults. 30+ languages. 4,200+ bugs. For those who want the full picture. Completionists get honored on the leaderboard.

Open Competition

The vanilla results are in. Now it's your turn.

Build a skill, a prompt, or a workflow that finds more bugs than vanilla AI. Share it openly. The best skill ships to the community and gets used in real code reviews.

100 planted bugs per project. 42 intentionally broken apps. One fair benchmark. The best skill wins - and helps make the internet a bit harder to exploit.

1. General purpose - must work on any codebase, not just this one
2. Public and open - anyone can inspect, use, and build on it
3. Fair play - no gaming the benchmark, no hard-coded findings
Read the rules Create your badge

Score Your Review

Upload your review.json to score it against encrypted ground truth.

📄
Drop review.json here or click to browse
JSON with { findings: [...] }