The scoring math, published in full.
Every weight, every cap, every nonlinearity. If you cannot reproduce a score on your own machine, it does not count as a verdict — so here is the spec, the table, and the code.
A scoring methodology that lives only in a marketing PDF is not a methodology. It is a brand. This article is the source of truth. Every number on the site, every badge, every bar in every chart traces back to the formula below.
If you can't reproduce a score on your own machine, it doesn't count as a verdict.— Creed 01 / 03
The whole thing in nine lines.
# Per-engine deduction (capped at 3.0)
def engine_deduction(findings, engine):
raw = sum(severity_w[f.severity] * engine_w[engine] for f in findings)
return min(raw, 3.0)
# Final trust score
def trust_score(scan):
deductions = sum(engine_deduction(scan[e], e) for e in ENGINES)
return max(0.0, min(10.0, 10.0 - deductions))
That's it. There is no machine-learning model, no per-server tuning, no learned ensemble. We trust auditable arithmetic over benchmark-tuned vibes.
Severity weights (CVSS-aligned).
| Severity | CVSS range | Weight |
|---|---|---|
| Critical | 9.0 — 10.0 | 1.00 |
| High | 7.0 — 8.9 | 0.60 |
| Medium | 4.0 — 6.9 | 0.30 |
| Low | 0.1 — 3.9 | 0.10 |
| Info | — | 0.00 |
Engine weights (signal quality).
| Bucket | Weight | Engines |
|---|---|---|
| High signal | 1.0 | OSV Scanner · Semgrep · Trivy |
| Solid general-purpose | 0.7 | Bandit · detect-secrets · Grype · Gitleaks |
| Moderate | 0.5 | Custom YARA · MCP Guardian · Checkov |
| Noisy | 0.3 | npm audit · pip-audit (verbose modes) |
| Informational | 0.0 | Syft · ScanCode · Cisco AIBOM |
From score to light.
| Light | Score | Meaning · README badge |
|---|---|---|
| Red | 0.0 — 4.9 | Do not connect without remediation. Critical findings present. |
| Amber | 5.0 — 7.0 | Connect with caution. Known issues; review before granting credentials. |
| Green | 7.1 — 10.0 | No high-severity findings detected by 16 independent engines. |
Why the per-engine cap is 3.0.
Without a cap, a single noisy engine could drive any score to zero. detect-secrets averages ~100 findings per flagged repo, most of which are entropy false-positives. Allowing it to deduct unbounded would make every repo with a base64 string score 0/10. The cap forces engine results to compete: a repo only goes red when multiple, independent tools agree.
The cap of 3.0 is calibrated so that:
- Five engines flagging at full severity drives a score to red.
- Two engines flagging at high severity puts a repo in amber.
- One noisy engine cannot move a repo more than 30% down the scale.
Reproduce any score locally.
# Clone, install, score
git clone https://github.com/diemoeve/mcpampel
cd mcpampel
uv sync
uv run mcpampel score --repo <url> --json
The output is a JSON document with every engine's findings, every applied weight, and the final score arithmetic shown step-by-step. If your number differs from the website's, that is a bug — file it.
Scan your MCP server now →
Sixteen engines, sixty seconds. Free, no account, no credit card.