Docs / Methodology / Scoring

How scoring works — from sixteen engines to one number.

The MCPAmpel score is deliberately boring math: a weighted mean of sixteen engine sub-scores, with one published nonlinearity. No learned model, no per-server tuning, no "AI judgement". You can reproduce every score on your own machine.

01 The formula

Each engine returns a sub-score on the 0–10 scale. The trust score is their weighted mean — clamped to 0 if any engine reports a critical finding (the only nonlinearity).

score=min( 10 , Σ w_i · s_i ) · 𝟙_safe where 𝟙_safe = 0 if any engine returns a critical finding, else 1

In plain English: sum the weighted sub-scores. Cap at 10. If anything truly catastrophic (verified leaked secret, RCE in a default-enabled tool, malicious typosquat) shows up, the whole thing goes to zero. There's no "the average was decent, ship it" path through a critical.

02 The weights

Weights are fixed at release and only change in versioned methodology updates. The current weights, summing to 100%, are published in methodology/weights-v1.4.json in the public repo.

FamilyWeightEngines

Vulnerability · CVE · SAST · postinstall42%5

MCP-specific · tool surface · shadowing · permissions21%3

Secrets & identity · trufflehog · gitleaks · MFA18%3

Supply chain · SLSA · sigstore · typosquat12%3

Meta & cadence · repo health · licenses (advisory)7%2

03 The light bands

The score maps to one of three lights. The bands are deliberately wide — small score wobbles shouldn't flip the light, because operators stop trusting alerts that flip on noise.

Red light

0.0 – 4.9

Do not install in production. Either critical findings or a stack of medium issues that compound.

Amber light

5.0 – 7.4

Proceed with eyes open. Fixable issues, mature project, but not a clean bill. Read the findings.

Green light

7.5 – 10.0

Safe to install. Worst-case findings are advisory or low-impact. Re-checked on every push.

04 What each engine actually measures

Every engine has a one-page methodology doc with its rule set, false-positive rate on our test corpus, and the version of its underlying CVE feed. Browse them in the engine catalog.

Sub-score normalization

Engines return wildly different raw outputs (CVSS 0–10, count of secrets, percent of contributors with MFA…). Each is normalized to the 0–10 trust scale via a published bucketing table. There's no log, no smoothing — just if-else on documented thresholds.

CVSS-based engines · raw CVSS → bucket → sub-score (e.g. 9.0 critical = 0.0, 7.0 high = 4.0, 4.0 medium = 7.0)
Count-based engines · n findings → diminishing-returns bucket (0 → 10.0, 1 → 8.0, 2-3 → 6.0, 4+ → 2.0)
Boolean engines · pass/fail directly to 10/0; some have partial credit (e.g. 80% MFA = 8.0)

05 The critical-finding floor

The 𝟙_safe term in the formula is the only place the score deviates from a simple linear combination. It exists because averaging hides catastrophes.

What counts as critical: a verified leaked secret (the API key actually works), an RCE-class CVE in a tool exposed by default, a confirmed typosquat in the dep graph, or a malicious-package match. We publish the full list in methodology/critical-rules.md.

If any engine reports a critical, the score is forced to 0 regardless of weights. This is by design: a server with an exposed secret isn't "8.5 with one issue" — it's compromised.

06 Reproducibility

Every scan is reproducible. The scan record contains: the SHA scanned, the version of each engine, the version of each rule pack, and the JSON output from each engine before normalization. Run the open-source scanner with the same versions, against the same SHA, and you get the same number — to one decimal place.

Why we care: a security tool you can't audit isn't a security tool, it's a vibe. The whole point of the trust light is that it earns its trust from being inspectable.

        ← Overview
        Engine catalog →