A public, measurable ranking of how transparent today's frontier AI models actually are โ across refusal behavior, capability disclosure, hidden filtering, and prompt monitoring.
Four signals from the current frontier. None of them are good news.
โTransparency is the next competitive moat. Not intelligence.โ
Six frontier models, scored across five transparency dimensions. Click any column to sort.
| Rank | Model | Composite โผ | Refusal Transparency | Capability Disclosure | Response Consistency | Hidden Behavior | Monitoring Disclosure |
|---|---|---|---|---|---|---|---|
| 1 | Llama 3.1 405B Highest transparency | 84 | |||||
| 2 | Claude Sonnet 4.5 | 62 | |||||
| 3 | Mistral Large 2 | 61 | |||||
| 4 | Gemini 1.5 Pro | 60 | |||||
| 5 | GPT-4o | 58 | |||||
| 6 | DeepSeek V3 Monitoring undisclosed | 49 |
Trust is not one number. It is a weighted judgment. Adjust the sliders โ the ranking reorders live.
Set the weight of each dimension. Higher = more important to your trust decision.
Open weights models tend to dominate on hidden behavior and monitoring disclosure. Closed models vary widely. Your weights decide who wins.
Five dimensions, each with a published test suite. No vibes. No surveys. Just reproducible prompts.
Does the model explain when and why it refuses?
Opaque refusals erode user trust. When a model says "I can't help with that" without reason, users can't distinguish safety from laziness or policy from preference. We test with 200 prompts where refusal should and shouldn't apply; score reflects consistency and explanation.
Does it own up to its limits and uncertainty?
Overclaiming is the most common failure mode of frontier models. A model that confidently hallucinates a citation is worse than one that admits ignorance. We use prompts designed to elicit overclaiming; the score penalizes confident hallucination and rewards calibrated uncertainty.
Does it give the same answer to the same prompt?
Inconsistency makes models unreliable for production use. If the same prompt gets three different answers, you can't build on the output. We ask the same prompt 10 times at temperature 0 and measure variance. High variance = low trust, even if the average answer is good.
Does it secretly degrade, filter, or rewrite responses?
The dimension the index was built around. Silent filtering is the most damaging trust violation โ you don't know what you're not getting. We diff test outputs against reference responses to detect invisible modifications. Score reflects how much of the response is hidden from the user.
Does it tell you when prompts are logged or reviewed?
If your prompts are being stored, reviewed, or used for training, you have a right to know. Silence about monitoring is a red flag. We inspect TOS, system cards, and runtime disclosure. The score rewards explicit notice of every prompt log and every data retention policy.
If something here is unclear, that's a bug. Tell us.
Open methodology. Open data. Open scorecard.