Benchmark: securereview-7b vs Qwen2.5-Coder-7B¶
Date: 2026-04-17
Scanner: Foil M10 agentic scanner (rules cap: 8, no tool-use for SR-7b)
Hardware: Apple Silicon, mlx-lm 0.31.1, vllm-mlx foil-patches
Method: Same scanner code, same settings, model swap via foil model activate
Models¶
| Model | Params | Size | Type |
|---|---|---|---|
vitorallo/securereview-7b-mlx-4bit |
7B | 4.0 GB | QLoRA fine-tune on Qwen2.5-Coder-7B |
mlx-community/Qwen2.5-Coder-7B-Instruct-4bit |
7B | 4.0 GB | Base model |
Test 1: python_vuln (3 Python files, intentional vulns)¶
| Qwen | SR-7b | |
|---|---|---|
| Findings | 7 | 7 |
| Shared | 7 | 7 |
| Agreement | 100% | 100% |
Both found all SQL injection, XSS, command injection, and broken access control.
Test 2: flask_vuln (1 Python file, Flask app)¶
| Qwen | SR-7b | |
|---|---|---|
| Findings | 3 | 3 |
| Duration | 42.7s | 33.2s |
| Fuzzy match (±5 lines) | 3/3 | 3/3 |
Same 3 vulnerabilities (2× SQL Injection, 1× XSS). Minor line offsets. SR-7b is 22% faster.
| # | Qwen | SR-7b |
|---|---|---|
| 1 | [HIGH] SQL Injection L19 c=1.0 | [HIGH] SQL Injection L23 c=0.95 |
| 2 | [MEDIUM] XSS L29 c=0.9 | [MEDIUM] XSS L29 c=0.85 |
| 3 | [HIGH] SQL Injection L37 c=1.0 | [HIGH] SQL Injection L41 c=0.95 |
Test 3: express_vuln (1 JS file, Express app)¶
| Qwen | SR-7b | |
|---|---|---|
| Findings | 5 | 4 |
| Shared | 2 | 2 |
| New (SR-7b only) | — | +2 |
| Lost (Qwen only) | -3 | — |
| Finding | Qwen | SR-7b |
|---|---|---|
| Security Misconfiguration L17 | HIGH | — dropped (FP reduction) |
| SQL Injection L28 | HIGH | — (found L43 instead) |
| XSS L38 | — | MEDIUM (new find) |
| SQL Injection L43/44 | MEDIUM L44 | MEDIUM L43 |
| SQL Injection L55 | MEDIUM | MEDIUM |
| Broken Authentication L61 | MEDIUM | MEDIUM |
SR-7b dropped 1 generic "Security Misconfiguration" (likely FP), found 1 XSS Qwen missed. More conservative on severity (MEDIUM vs HIGH).
Test 4: wagtail/api (17 Python files, Django REST framework, production code)¶
| Qwen | SR-7b | |
|---|---|---|
| Findings | 0 | 0 |
Both models correctly identify clean code. No false positive hallucination on well-written Django REST API. This is the critical FP test — SR-7b doesn't introduce noise on production-quality code.
Test 5: next-page-consumer (rav-code subfolder, ~30 TS files, Next.js)¶
Note: This comparison is partially confounded — the SR-7b scan ran after guardrails were added (rules cap, tool-use skip). The Qwen baseline was from an earlier scanner version. Included for completeness but not a clean A/B.
| Qwen (old scanner) | SR-7b (new scanner) | |
|---|---|---|
| Findings | 82 (H:8 M:27 L:47) | 108 (H:72 M:13 L:23) |
Key differences:
- SR-7b found 21 IDOR findings Qwen missed (route handlers with [id] params)
- SR-7b dropped most LOW "Security Misconfiguration" (noise reduction)
- SR-7b upgraded many findings to HIGH (more aggressive severity)
- SQL Injection spike (+27) was partially FPs on fetch() template literals
Integration notes¶
securereview-7b requires specific guardrails in the Foil scanner:
- No guided decoding (
guided_schema=None) — the model was trained on a fixed JSON schema and guided decoding conflicts with the fine-tune, causing output floods - No tool-use prompt — the tool-use template (
TOOL_USE_TEMPLATE) triggers verbosity; SR-7b uses the standard review prompt only - Rules cap at 8 — the model was trained with max ~6 rules per example; more overwhelms it
- Findings cap at 10 per call — safety net for truncated JSON recovery
These guardrails are auto-applied when "securereview" is detected in the model name. Qwen models use the full tool-use path with guided decoding as before.
Conclusions¶
- Detection parity — on clean benchmarks (python_vuln, flask_vuln, wagtail), both models find the same vulnerabilities
- FP reduction — SR-7b drops generic "Security Misconfiguration" noise, doesn't hallucinate on clean code
- Speed — SR-7b is ~20% faster (smaller effective output, no thinking overhead)
- Severity calibration — SR-7b is more conservative (more MEDIUM vs HIGH)
- IDOR/logic bugs — SR-7b detects more IDOR patterns (trained specifically on these)
- Limitations — SR-7b can over-flag
fetch()template literals as SQL injection on large JS codebases
Recommendation¶
Use securereview-7b as the default model. Its FP reduction and IDOR detection advantages outweigh the minor severity calibration difference. For critical audits, run with --deep to let Phase 6 investigation dismiss remaining FPs.