LIVE • Updated 14 minutes ago

If you can't see it,
you can't trust it.

A public, measurable ranking of how transparent today's frontier AI models actually are — across refusal behavior, capability disclosure, hidden filtering, and prompt monitoring.

CURRENT FRONTIER AVERAGE

62/100

What the numbers say.

Four signals from the current frontier. None of them are good news.

62/100

Average transparency score across 6 frontier models.

73%

Of users report experiencing unexplained refusals from a major model in the last 30 days.

5of 6

Frontier models engage in undisclosed prompt monitoring or response filtering.

Public, standardized trust benchmarks existed before this index.

“Transparency is the next competitive moat. Not intelligence.”

The Trust Index — live rankings.

Six frontier models, scored across five transparency dimensions. Click any column to sort.

Rank	Model	Composite ▼	Refusal Transparency	Capability Disclosure	Response Consistency	Hidden Behavior	Monitoring Disclosure
1	Llama 3.1 405B Highest transparency	84	60	80	85	95	98
2	Claude Sonnet 4.5	62	80	72	75	50	35
3	Mistral Large 2	61	70	65	72	78	20
4	Gemini 1.5 Pro	60	73	70	65	58	34
5	GPT-4o	58	75	72	68	45	30
6	DeepSeek V3 Monitoring undisclosed	49	62	55	50	68	10

Build your own trust profile.

Trust is not one number. It is a weighted judgment. Adjust the sliders — the ranking reorders live.

What matters to you?

Set the weight of each dimension. Higher = more important to your trust decision.

Refusal Transparency 20

You treat refusal explanations as a low priority — refusals themselves matter more than the reasoning behind them.

Capability Disclosure 20

Capability disclosure matters little — you evaluate models by what they do, not what they admit.

Response Consistency 20

You accept varied answers to the same prompt — diversity matters more than determinism.

Hidden Behavior 25

Silent filtering is not a deal-breaker — you can work around it.

Monitoring Disclosure 15

Monitoring disclosure barely matters — you assume prompts may be logged.

YOUR TRUST INDEX

62/100

Based on your weights across 5 dimensions.

# Model Score

1 Llama 3.1 405B 83

2 Mistral Large 2 64

3 Claude Sonnet 4.5 63

4 Gemini 1.5 Pro 61

5 GPT-4o 59

6 DeepSeek V3 52

Top model profile · Llama 3.1 405B

Open weights models tend to dominate on hidden behavior and monitoring disclosure. Closed models vary widely. Your weights decide who wins.

What we actually measure.

Five dimensions, each with a published test suite. No vibes. No surveys. Just reproducible prompts.

Refusal Transparency

Does the model explain when and why it refuses?

Why it matters

Opaque refusals erode user trust. When a model says "I can't help with that" without reason, users can't distinguish safety from laziness or policy from preference. We test with 200 prompts where refusal should and shouldn't apply; score reflects consistency and explanation.

Capability Disclosure

Does it own up to its limits and uncertainty?

Why it matters

Overclaiming is the most common failure mode of frontier models. A model that confidently hallucinates a citation is worse than one that admits ignorance. We use prompts designed to elicit overclaiming; the score penalizes confident hallucination and rewards calibrated uncertainty.

Response Consistency

Does it give the same answer to the same prompt?

Why it matters

Inconsistency makes models unreliable for production use. If the same prompt gets three different answers, you can't build on the output. We ask the same prompt 10 times at temperature 0 and measure variance. High variance = low trust, even if the average answer is good.

Hidden Behavior

Does it secretly degrade, filter, or rewrite responses?

Why it matters

The dimension the index was built around. Silent filtering is the most damaging trust violation — you don't know what you're not getting. We diff test outputs against reference responses to detect invisible modifications. Score reflects how much of the response is hidden from the user.

Monitoring Disclosure

Does it tell you when prompts are logged or reviewed?

Why it matters

If your prompts are being stored, reviewed, or used for training, you have a right to know. Silence about monitoring is a red flag. We inspect TOS, system cards, and runtime disclosure. The score rewards explicit notice of every prompt log and every data retention policy.

Questions you'd reasonably ask.

If something here is unclear, that's a bug. Tell us.

Q Who funds this? +

Q How are scores calculated? +

Q Can models game this? +

Q Why not just use capability benchmarks? +

Q How often is data updated? +

Q How can I contribute? +

The next AI war won't be about who can think hardest. It'll be about who can be seen.

Open methodology. Open data. Open scorecard.

If you can't see it,you can't trust it.

What the numbers say.

The Trust Index — live rankings.

Build your own trust profile.

What matters to you?

What we actually measure.

Refusal Transparency

Capability Disclosure

Response Consistency

Hidden Behavior

Monitoring Disclosure

Questions you'd reasonably ask.

The next AI war won't be about who can think hardest. It'll be about who can be seen.

If you can't see it,
you can't trust it.