Methodology

Counting findings rewards volume, not skill. This ranking instead measures differentiated quality: when several firms review the same protocol, who catches what the others miss?

The leaderboard and trends are recomputed in your browser. Severity weights are anchored to a Critical; sliders set High and Medium relative to it, plus the minimum number of co-audited protocols a firm needs to be ranked. (Solodit reports Critical findings as High, so we recover them from the report's severity tag.)

1. Source data

Findings come from Solodit, which aggregates reports from major audit firms and contest platforms. Each finding carries a firm, a protocol, a severity, and a date. We track a fixed roster of ~30 firms.

2. Co-audited protocols

We keep only protocols reviewed by two or more tracked firms. Firms audit the same protocol months or years apart, so co-audit is defined across time, not within a single quarter. A protocol's “roster” is the set of firms that reported at least one finding on it.

3. Matching the same vulnerability

Different firms describe the same bug differently. We cluster findings on each protocol into distinct underlying issues — using a language model to judge semantic equivalence, with a deterministic token-overlap fallback. A clustering eval against a labeled set guards against drift.

4. Comparison protocols only

We only score a protocol when the matcher found at least one issue reported by two or more firms — evidence the firms actually reviewed comparable scope. On a real audit dataset most findings are unique to one firm, so crediting raw counts would just reward volume (a large contest “out-finds” any single firm). Restricting to comparison protocols keeps it honest.

5. Catch rate (the score)

On each comparison protocol, the clustered issues are the “known bug set.” A firm's catch rate is the severity-weighted share of that set it reported:

catchRate = Σ severityWeight(issues the firm caught)
            ─────────────────────────────────────────
            Σ severityWeight(all issues on the protocol)

Catching a bug a peer missed raises your rate and sits in everyone else's denominator.
It's capped at 100% and volume-fair — finding more bugs raises the denominator for all firms equally.
Pooled across all of a firm's comparison protocols; firms with too few comparisons are left unranked.

Severity weights (only these count — Low, Gas and Informational findings are dropped at ingestion):

Critical = 10High = 5Medium = 2

For the time series, each protocol is attributed to the quarter of its earliest known audit date; protocols whose date can't be recovered still count in the all-time ranking but not the chart.

Limitations

“Audited” is proxied by “reported ≥1 finding.” A firm that reviewed a protocol but reported nothing is invisible to the model.
Coverage depends on what Solodit has indexed; private or unpublished audits are absent.
Solodit's findings carry no reliable date, so most are placed on the time series via a date recovered from the protocol name or report URL — only a minority of findings are datable. The all-time ranking is the more complete view.
Cross-firm comparison spans time: two firms may have audited different versions of the same protocol, so a “miss” can reflect code that changed between audits, not only reviewer skill.
Cross-firm matching is imperfect; a mis-merge or mis-split shifts credit. We measure this with the eval set but cannot eliminate it.
Severity is taken as reported; firms calibrate severity differently. We take the max severity within a matched cluster.
This is a heuristic signal of relative finding quality — not an endorsement or a complete picture of an audit's value.

Snapshot generated Tue, 23 Jun 2026 17:06:01 GMT · source: solodit