How to Validate Your Security Scanner: Lessons from Running Automated Skills Against a Real Target

Published: April 7, 2026 Reading time: 12 min

Some links in this article may be affiliate links. We may earn a commission if you purchase through them, at no extra cost to you.

Every security scanner works perfectly against its own test suite. The question is whether it works against real targets. We ran 6 SecurityClaw passive scanning skills against httpbin.org — a public HTTP testing service — and documented every finding. 5 skills produced accurate results with zero false positives. 1 skill produced 14 false positives, all caused by a single bug. This article covers what we tested, what we found, and the broader lessons about validating automated security tools before trusting them in production.

Why You Need to Validate Your Scanner Against Real Targets

Unit tests tell you that your scanner's detection logic works when given a specific input. They don't tell you whether the scanner works when pointed at a real HTTP server with real response headers, real TLS certificates, real content types, and real network behavior.

The gap between "works in tests" and "works in production" is where false positives live. And false positives are the silent killer of security scanning programs. A scanner that produces 50 findings per scan sounds impressive until your team realizes 45 of them are noise. After a few rounds of that, nobody reads the reports anymore.

The solution is straightforward: before you trust a scanner in your pipeline, run it against a real target where you know the ground truth. Not a deliberately vulnerable app like DVWA or WebGoat — those are designed to be exploitable. You want a normal, production-like target where you can manually verify every finding.

httpbin.org is ideal for this. It's a public HTTP testing service (run by Postman) that accepts arbitrary requests and returns predictable responses. It has real TLS, real headers (or lack thereof), and real behavior. It's explicitly designed for testing HTTP clients — which is exactly what a security scanner is.

The Test Setup: 6 Skills, 1 Target, 0 Mocks

We tested 6 SecurityClaw passive scanning skills — the ones that only read responses and don't inject payloads or modify state. These are the skills you'd run in a CI/CD pipeline without needing explicit approval:

Skill	OWASP Category	What It Checks
tls-crypto-auditor	A02 Cryptographic Failures	TLS version, cipher suites, certificate validity, HSTS
security-header-checker	A04 Insecure Design	7 security response headers (CSP, X-Frame-Options, etc.)
logging-monitor-checker	A09 Logging/Monitoring	Request tracing headers, rate limiting, debug endpoints, verbose errors
sri-checker	A08 Data Integrity	Subresource Integrity on external scripts/stylesheets
session-security-tester	A07 Auth Failures	Cookie flags (Secure, HttpOnly, SameSite), session token entropy
xss-probe	A03 Injection	Reflected XSS via 7 payload patterns across URL parameters

Each skill was run directly against https://httpbin.org using its run() method. No mocking, no stubbing, no test fixtures. Real HTTPS connections to a real server.

We excluded active injection skills (sqli-probe, ssrf-probe) because those modify state and require explicit target approval. Passive skills only — the kind you'd run without asking permission.

Skill-by-Skill Results

tls-crypto-auditor — 1 Finding, 100% Accurate

The TLS auditor checked 5 things and found 1 issue:

TLS Version: TLSv1.2 — acceptable, not flagged ✓
Cipher Suite: ECDHE-RSA-AES128-GCM-SHA256 — strong, not flagged ✓
Certificate: Valid, issued by Amazon RSA 2048 M03, expires August 2026 ✓
HSTS: Missing — flagged as LOW severity ✓
Weak Ciphers: None detected ✓

The HSTS finding is a true positive. httpbin.org does not send a Strict-Transport-Security header. For a testing service this is low-risk, but the scanner correctly identified it. No false positives.

security-header-checker — 7 Findings, 100% Accurate

httpbin.org is missing all 7 security headers the skill checks:

Content-Security-Policy (HIGH) ✓
X-Content-Type-Options (MEDIUM) ✓
X-Frame-Options (MEDIUM) ✓
Referrer-Policy (MEDIUM) ✓
Permissions-Policy (MEDIUM) ✓
Cross-Origin-Opener-Policy (LOW) ✓
Cross-Origin-Resource-Policy (LOW) ✓

All 7 are true positives. httpbin.org is a testing service and doesn't set security headers — which is exactly what you'd expect. The skill correctly identified every gap without producing any false positives.

logging-monitor-checker — 2 Findings, 100% Accurate

Two issues found, two true negatives correctly ignored:

Missing request tracing (MEDIUM) ✓ — no X-Request-ID or X-Correlation-ID headers
Missing rate limiting (MEDIUM) ✓ — no X-RateLimit-* headers
Verbose errors: Not detected ✓ — httpbin.org doesn't expose stack traces on the root path
Debug endpoints: Not found ✓ — /debug/, /actuator/, etc. all return non-200

Note: the debug endpoint check makes 8 additional HTTP requests (one per known debug path). This is expected behavior — the skill probes common debug paths to see if they're exposed. All returned non-200, correctly classified as true negatives.

sri-checker — 1 Finding, 100% Accurate

The SRI checker found that httpbin.org's Swagger UI page loads Google Fonts CSS without a integrity attribute:

<link href="https://fonts.googleapis.com/css?family=Open+Sans:400,700|Source+Code+Pro:300,600|Titillium+Web:400,600,700" rel="stylesheet">

This is a true positive and a real supply chain risk (OWASP A08). If Google Fonts CDN were compromised, the loaded CSS could be modified without the browser detecting it. Adding SRI hashes to external resources prevents this.

session-security-tester — 0 Findings, Correct

httpbin.org doesn't set any cookies on a plain GET request. The session security tester correctly found nothing to report. This is a true negative — the skill didn't invent findings where none exist.

This is actually an important validation. A bad scanner would flag "no session management detected" as a finding. A good scanner recognizes that not every endpoint uses sessions, and absence of cookies is not a vulnerability.

xss-probe — 14 Findings, ALL False Positives ⚠️

This is where it gets interesting. The XSS probe tested 7 payload patterns across 2 URL parameters on httpbin.org's /get endpoint. It reported 14 HIGH severity reflected XSS findings.

Every single one was a false positive.

Here's why: httpbin.org's /get endpoint echoes query parameters back in the response body — but as JSON, not HTML:

GET /get?q=<script>alert(1)</script>

Response (Content-Type: application/json):
{
  "args": {
    "q": "<script>alert(1)</script>"
  },
  ...
}

The payload appears in the response, but inside a JSON string with a Content-Type: application/json header. Browsers won't render this as HTML. There's no XSS vulnerability here — the payload is data, not executable code.

We'll dig into this bug in the next section, because it illustrates a fundamental challenge in automated XSS detection.

The XSS False Positive Bug: Why Content-Type Matters

The xss-probe skill uses a straightforward detection approach: inject a payload into a URL parameter, check if the payload appears in the response body. If it does, flag it as reflected XSS.

This works correctly for HTML responses. If you send <script>alert(1)</script> as a parameter and it appears unescaped in an HTML response, that's reflected XSS. The browser will execute it.

But the skill doesn't check the Content-Type header. Any endpoint that echoes input — REST APIs returning JSON, XML endpoints, plain text debug pages — will trigger a false positive. The payload appears in the response, but in a context where the browser won't execute it.

The Fix

The fix is a single check: before flagging a finding as XSS, verify that the response Content-Type is text/html or application/xhtml+xml. If the response is JSON, XML, or plain text, either skip the finding entirely or downgrade it to INFO severity.

This is a common pattern in security scanner development. The detection logic is correct in isolation — the payload IS reflected in the response. But the context matters. XSS requires an HTML rendering context. Without it, reflection is just echo, not vulnerability.

Why This Matters for Your Scanner

If you're running any XSS scanner against a modern application with REST APIs, you're probably getting false positives on every API endpoint that echoes parameters. Most applications have dozens of these. That's dozens of false positives per scan, every scan, forever — unless your scanner checks Content-Type.

Ask your scanner vendor: does your XSS detection check the response Content-Type before flagging reflected payloads? If they can't answer that question, you have a false positive problem you don't know about yet.

Precision and Recall: The Numbers

Here's the full accuracy breakdown across all 6 skills:

Skill	True Positives	False Positives	True Negatives	Precision
tls-crypto-auditor	1	0	4	100%
security-header-checker	7	0	0	100%
logging-monitor-checker	2	0	2	100%
sri-checker	1	0	0	100%
session-security-tester	0	0	1	N/A
xss-probe	0	14	0	0%
Total	11	14	7	44%

The aggregate precision of 44% looks bad — but it's entirely driven by one skill. Remove xss-probe and the remaining 5 skills have 100% precision across 11 true positives and 0 false positives.

This is a common pattern in security scanning: most of your false positive noise comes from a small number of detection rules. Fixing those specific rules has an outsized impact on overall scanner usability.

After the Content-Type fix was applied to xss-probe, all 14 false positives were eliminated. The corrected aggregate precision across all 6 skills: 100%.

5 Lessons for Validating Any Security Scanner

1. Test Against Boring Targets, Not Vulnerable Ones

DVWA and WebGoat are great for learning, but they're terrible for validation. They're designed to be exploitable — everything is a true positive. You learn nothing about false positive rates.

Use a normal target like httpbin.org where you know the ground truth. A good scanner should find real issues (missing headers, missing HSTS) and NOT find issues that aren't there.

2. Measure Precision, Not Just Detection Count

"Found 25 vulnerabilities" means nothing without knowing how many are real. A scanner that finds 11 real issues and 0 false positives is infinitely more useful than one that finds 25 "issues" where 14 are noise.

Track true positives, false positives, and true negatives for every scan. If your precision drops below 80%, your team will stop reading the reports.

3. One Bad Rule Ruins the Whole Report

Our xss-probe produced more false positives (14) than all other skills produced true positives (11) combined. One detection rule with a Content-Type bug turned a clean report into a noisy one.

When you find a noisy rule, fix it immediately. Don't just suppress the findings — fix the detection logic. Every false positive that ships erodes trust in the entire scanner.

4. Passive Skills Are Surprisingly Accurate

The 5 passive skills (header checks, TLS audit, SRI, session analysis, logging) all achieved 100% precision. These skills check observable facts — either the header is present or it isn't, either the certificate is valid or it isn't.

Active injection skills (XSS, SQLi, SSRF) are inherently harder because they require understanding the response context. If your scanner separates passive and active checks, run the passive ones first — they'll give you a clean, accurate baseline.

5. Validate Before You Automate

If you're wiring a scanner into CI/CD, run it manually against a known target first. Document every finding. Verify each one. Fix any false positives. THEN automate it.

A scanner that produces false positives in CI/CD will either block deployments unnecessarily (if you fail on findings) or train your team to ignore security alerts (if you don't). Both outcomes are worse than not scanning at all.

A Validation Framework You Can Steal

Here's the process we used, generalized for any scanner:

Pick a known target. httpbin.org for web scanners. A test AWS account for cloud scanners. A sample repo for SAST tools. The key requirement: you know the ground truth.
Run every detection rule. Don't cherry-pick. Run the full scan so you see the full false positive surface.
Classify every finding. True positive (real issue), false positive (not a real issue), or true negative (correctly didn't flag). Be honest — if you're not sure, investigate until you are.
Calculate precision per rule. Which rules are accurate? Which ones produce noise? This tells you exactly where to focus improvement effort.
Fix the noisy rules. Don't suppress findings — fix the detection logic. Our xss-probe fix was a single Content-Type check. Most false positive bugs have similarly simple fixes.
Re-validate after fixes. Run the same scan again. Confirm the false positives are gone and the true positives are still detected. Regression testing for security rules.
Document the baseline. Write down what findings the scanner produces against your known target. This becomes your regression baseline — if a future scanner update changes these results, you'll know immediately.

This entire process took us about 30 minutes for 6 skills. The ROI is enormous: we found and fixed a bug that would have produced 14 false positives on every scan of every target with JSON API endpoints. That's hundreds of false positives prevented across every future scan.

The Bottom Line

Security scanners are only as good as their accuracy. A scanner that finds real issues without crying wolf is worth 10x more than one that finds everything including things that aren't there.

Validate your scanner against a real target. Measure precision per rule. Fix the noisy rules. Then automate with confidence.

5 of our 6 skills passed with 100% precision on the first try. The 6th had a clear, fixable bug. After the fix, all 6 are production-ready. That's the outcome you want from a validation exercise — not "everything works" (suspicious) or "everything's broken" (useless), but "here's exactly what works, here's what doesn't, and here's the fix."