TRASHFIRE

How to run

Three steps. No scripts. You talk to your AI, it reviews the code, you score the result.

STEP 1 - SETUP (once)

        $ git clone https://github.com/ktdmax/trashfire.git

        $ cd trashfire/_scoring && npm install && cd ..

STEP 2 - REVIEW

Open your AI tool in the repo folder. Load your skills if you have any. Pick a vault:

Standard (3 vaults) Focused (1 vault) Ultimate (42 vaults)

Clone https://github.com/ktdmax/trashfire.git - read BENCHMARK_RUNNER.md for the full rules. Run: cd _scoring && npm install && cd .. && bash _scoring/create-blind-copy.sh. IMPORTANT: Do not grep for BUG markers. The review prompt is in _prompts/base-review.md (defines JSON output format, 6 categories: SEC/LOGIC/PERF/BP/SMELL/TRICKY, severity levels, and rules). Use it as Layer 0. Review these 3 vaults one by one: _blind/grog-shop/, _blind/tentacle-labs/, _blind/lechuck-crypt/. Read each file individually from top to bottom. Cover ALL categories: security, logic, performance, best practices, code smells, and tricky cross-module bugs. Report each finding separately as JSON. Upload results at trashfire.io/#score-section to score against ground truth.

3 vaults (Next.js + Python + C), ~1.5-3 hours, standard benchmark

STEP 3 - SCORE

        $ curl -s -X POST https://trashfire.io/api/score \

          -H "Content-Type: application/json" \

          -d '{"project":"grog-shop","review":...}' | jq .

That's it. Or drag-and-drop your review.json below.

Works with any AI tool that can read files: Claude Code, Cursor, Windsurf, Codex, Gemini, Aider, ... The AI reads the code itself using its own tools. No subprocess, no API proxy, no file pasting. That's what makes it a real test.

What we test

Six dimensions of code review ability, weighted by importance.

35%

Security

OWASP Top 10, CWE-mapped vulnerabilities

25%

Tricky

Cross-module bugs, edge cases, timing

20%

Logic

Off-by-one, race conditions, async

10%

Performance

N+1, memory leaks, blocking I/O

Best Practice

Hardcoded values, swallowed errors

Code Smell

Dead code, duplication, god functions

42 Vaults

Three benchmark tiers. Pick your level.

STANDARD BENCHMARK

3 Vaults - ~3 hours

grog-shop Next.js/TypeScript - Web
tentacle-labs Python/Flask - API
lechuck-crypt C/OpenSSL - Systems

The official benchmark. Three stacks, three worlds. Your composite score is the average of all three.

FOCUSED

1 Vault - ~1 hour

Pick any vault. Test a specialized skill. Python expert? Try tentacle-labs. Rust dev? Try guybrush-ledger. Solidity? spiffy-anchor.

ULTIMATE

42 Vaults - ~42 hours

All 42 vaults. 30+ languages. 4,200+ bugs. For those who want the full picture. Completionists get honored on the leaderboard.

Open Competition

The vanilla results are in. Now it's your turn.

Build a skill, a prompt, or a workflow that finds more bugs than vanilla AI. Share it openly. The best skill ships to the community and gets used in real code reviews.

100 planted bugs per project. 42 intentionally broken apps. One fair benchmark. The best skill wins - and helps make the internet a bit harder to exploit.

1. General purpose - must work on any codebase, not just this one

2. Public and open - anyone can inspect, use, and build on it

3. Fair play - no gaming the benchmark, no hard-coded findings

Read the rules Create your badge

Score Your Review

Upload your review.json to score it against encrypted ground truth.

Project

📄

Drop review.json here or click to browse

JSON with { findings: [...] }

How to run

What we test

Results

42 Vaults

Open Competition

Score Your Review