Skip to content

Benchmark: securereview-7b vs Qwen2.5-Coder-7B

Date: 2026-04-17 Scanner: Foil M10 agentic scanner (rules cap: 8, no tool-use for SR-7b) Hardware: Apple Silicon, mlx-lm 0.31.1, vllm-mlx foil-patches Method: Same scanner code, same settings, model swap via foil model activate

Models

Model Params Size Type
vitorallo/securereview-7b-mlx-4bit 7B 4.0 GB QLoRA fine-tune on Qwen2.5-Coder-7B
mlx-community/Qwen2.5-Coder-7B-Instruct-4bit 7B 4.0 GB Base model

Test 1: python_vuln (3 Python files, intentional vulns)

Qwen SR-7b
Findings 7 7
Shared 7 7
Agreement 100% 100%

Both found all SQL injection, XSS, command injection, and broken access control.

Test 2: flask_vuln (1 Python file, Flask app)

Qwen SR-7b
Findings 3 3
Duration 42.7s 33.2s
Fuzzy match (±5 lines) 3/3 3/3

Same 3 vulnerabilities (2× SQL Injection, 1× XSS). Minor line offsets. SR-7b is 22% faster.

# Qwen SR-7b
1 [HIGH] SQL Injection L19 c=1.0 [HIGH] SQL Injection L23 c=0.95
2 [MEDIUM] XSS L29 c=0.9 [MEDIUM] XSS L29 c=0.85
3 [HIGH] SQL Injection L37 c=1.0 [HIGH] SQL Injection L41 c=0.95

Test 3: express_vuln (1 JS file, Express app)

Qwen SR-7b
Findings 5 4
Shared 2 2
New (SR-7b only) +2
Lost (Qwen only) -3
Finding Qwen SR-7b
Security Misconfiguration L17 HIGH — dropped (FP reduction)
SQL Injection L28 HIGH — (found L43 instead)
XSS L38 MEDIUM (new find)
SQL Injection L43/44 MEDIUM L44 MEDIUM L43
SQL Injection L55 MEDIUM MEDIUM
Broken Authentication L61 MEDIUM MEDIUM

SR-7b dropped 1 generic "Security Misconfiguration" (likely FP), found 1 XSS Qwen missed. More conservative on severity (MEDIUM vs HIGH).

Test 4: wagtail/api (17 Python files, Django REST framework, production code)

Qwen SR-7b
Findings 0 0

Both models correctly identify clean code. No false positive hallucination on well-written Django REST API. This is the critical FP test — SR-7b doesn't introduce noise on production-quality code.

Test 5: next-page-consumer (rav-code subfolder, ~30 TS files, Next.js)

Note: This comparison is partially confounded — the SR-7b scan ran after guardrails were added (rules cap, tool-use skip). The Qwen baseline was from an earlier scanner version. Included for completeness but not a clean A/B.

Qwen (old scanner) SR-7b (new scanner)
Findings 82 (H:8 M:27 L:47) 108 (H:72 M:13 L:23)

Key differences: - SR-7b found 21 IDOR findings Qwen missed (route handlers with [id] params) - SR-7b dropped most LOW "Security Misconfiguration" (noise reduction) - SR-7b upgraded many findings to HIGH (more aggressive severity) - SQL Injection spike (+27) was partially FPs on fetch() template literals

Integration notes

securereview-7b requires specific guardrails in the Foil scanner:

  1. No guided decoding (guided_schema=None) — the model was trained on a fixed JSON schema and guided decoding conflicts with the fine-tune, causing output floods
  2. No tool-use prompt — the tool-use template (TOOL_USE_TEMPLATE) triggers verbosity; SR-7b uses the standard review prompt only
  3. Rules cap at 8 — the model was trained with max ~6 rules per example; more overwhelms it
  4. Findings cap at 10 per call — safety net for truncated JSON recovery

These guardrails are auto-applied when "securereview" is detected in the model name. Qwen models use the full tool-use path with guided decoding as before.

Conclusions

  1. Detection parity — on clean benchmarks (python_vuln, flask_vuln, wagtail), both models find the same vulnerabilities
  2. FP reduction — SR-7b drops generic "Security Misconfiguration" noise, doesn't hallucinate on clean code
  3. Speed — SR-7b is ~20% faster (smaller effective output, no thinking overhead)
  4. Severity calibration — SR-7b is more conservative (more MEDIUM vs HIGH)
  5. IDOR/logic bugs — SR-7b detects more IDOR patterns (trained specifically on these)
  6. Limitations — SR-7b can over-flag fetch() template literals as SQL injection on large JS codebases

Recommendation

Use securereview-7b as the default model. Its FP reduction and IDOR detection advantages outweigh the minor severity calibration difference. For critical audits, run with --deep to let Phase 6 investigation dismiss remaining FPs.