Benchmark: securereview-7b vs Qwen2.5-Coder-7B¶

Date: 2026-04-17 Scanner: Foil M10 agentic scanner (rules cap: 8, no tool-use for SR-7b) Hardware: Apple Silicon, mlx-lm 0.31.1, vllm-mlx foil-patches Method: Same scanner code, same settings, model swap via foil model activate

Models¶

Model	Params	Size	Type
`vitorallo/securereview-7b-mlx-4bit`	7B	4.0 GB	QLoRA fine-tune on Qwen2.5-Coder-7B
`mlx-community/Qwen2.5-Coder-7B-Instruct-4bit`	7B	4.0 GB	Base model

Test 1: python_vuln (3 Python files, intentional vulns)¶

	Qwen	SR-7b
Findings	7	7
Shared	7	7
Agreement	100%	100%

Both found all SQL injection, XSS, command injection, and broken access control.

Test 2: flask_vuln (1 Python file, Flask app)¶

	Qwen	SR-7b
Findings	3	3
Duration	42.7s	33.2s
Fuzzy match (±5 lines)	3/3	3/3

Same 3 vulnerabilities (2× SQL Injection, 1× XSS). Minor line offsets. SR-7b is 22% faster.

#	Qwen	SR-7b
1	[HIGH] SQL Injection L19 c=1.0	[HIGH] SQL Injection L23 c=0.95
2	[MEDIUM] XSS L29 c=0.9	[MEDIUM] XSS L29 c=0.85
3	[HIGH] SQL Injection L37 c=1.0	[HIGH] SQL Injection L41 c=0.95

Test 3: express_vuln (1 JS file, Express app)¶

	Qwen	SR-7b
Findings	5	4
Shared	2	2
New (SR-7b only)	—	+2
Lost (Qwen only)	-3	—

Finding	Qwen	SR-7b
Security Misconfiguration L17	HIGH	— dropped (FP reduction)
SQL Injection L28	HIGH	— (found L43 instead)
XSS L38	—	MEDIUM (new find)
SQL Injection L43/44	MEDIUM L44	MEDIUM L43
SQL Injection L55	MEDIUM	MEDIUM
Broken Authentication L61	MEDIUM	MEDIUM

SR-7b dropped 1 generic "Security Misconfiguration" (likely FP), found 1 XSS Qwen missed. More conservative on severity (MEDIUM vs HIGH).

Test 4: wagtail/api (17 Python files, Django REST framework, production code)¶

	Qwen	SR-7b
Findings	0	0

Both models correctly identify clean code. No false positive hallucination on well-written Django REST API. This is the critical FP test — SR-7b doesn't introduce noise on production-quality code.

Test 5: next-page-consumer (rav-code subfolder, ~30 TS files, Next.js)¶

Note: This comparison is partially confounded — the SR-7b scan ran after guardrails were added (rules cap, tool-use skip). The Qwen baseline was from an earlier scanner version. Included for completeness but not a clean A/B.

	Qwen (old scanner)	SR-7b (new scanner)
Findings	82 (H:8 M:27 L:47)	108 (H:72 M:13 L:23)

Key differences: - SR-7b found 21 IDOR findings Qwen missed (route handlers with [id] params) - SR-7b dropped most LOW "Security Misconfiguration" (noise reduction) - SR-7b upgraded many findings to HIGH (more aggressive severity) - SQL Injection spike (+27) was partially FPs on fetch() template literals

Integration notes¶

securereview-7b requires specific guardrails in the Foil scanner:

No guided decoding (guided_schema=None) — the model was trained on a fixed JSON schema and guided decoding conflicts with the fine-tune, causing output floods
No tool-use prompt — the tool-use template (TOOL_USE_TEMPLATE) triggers verbosity; SR-7b uses the standard review prompt only
Rules cap at 8 — the model was trained with max ~6 rules per example; more overwhelms it
Findings cap at 10 per call — safety net for truncated JSON recovery

These guardrails are auto-applied when "securereview" is detected in the model name. Qwen models use the full tool-use path with guided decoding as before.

Conclusions¶

Detection parity — on clean benchmarks (python_vuln, flask_vuln, wagtail), both models find the same vulnerabilities
FP reduction — SR-7b drops generic "Security Misconfiguration" noise, doesn't hallucinate on clean code
Speed — SR-7b is ~20% faster (smaller effective output, no thinking overhead)
Severity calibration — SR-7b is more conservative (more MEDIUM vs HIGH)
IDOR/logic bugs — SR-7b detects more IDOR patterns (trained specifically on these)
Limitations — SR-7b can over-flag fetch() template literals as SQL injection on large JS codebases

Recommendation¶

Use securereview-7b as the default model. Its FP reduction and IDOR detection advantages outweigh the minor severity calibration difference. For critical audits, run with --deep to let Phase 6 investigation dismiss remaining FPs.