The AI Safety Report That Won't Tell You Everything (And What You Actually Need to Know)

4 min read
Article

Anthropic just released a safety report claiming Claude poses no risk. The real story? What they didn't test.

The free AI newsletter
The AI Safety Report That Won't Tell You Everything (And What You Actually Need to Know)

Anthropic's Self-Assessment: What the RSP Actually Tests

Anthropic just dropped its latest safety report, and if you read only the headline, you'd think Claude's perfectly safe. The company's Responsible Scaling Policy (RSP) is their internal framework for evaluating AI risk, with levels ranging from ASL-1 (basically harmless) to ASL-4 (catastrophic risk territory).

Claude sits at ASL-2, which Anthropic defines as models whose capabilities don't exceed what's already publicly available. To maintain this classification, they tested for three main threat categories: CBRN (chemical, biological, radiological, and nuclear weapons), cybersecurity exploits, and autonomous behavior that could spiral out of control.

The tests are rigorous within their scope. Can Claude help someone synthesize a dangerous pathogen? Can it hack into systems? Will it start acting on its own without human oversight? According to Anthropic's evaluation, the answer to all three is no.

The Gaps: What Didn't Make the Cut

Here's where it gets interesting. Safety reports are as much about what they test as what they don't. And Anthropic's RSP conveniently skips three significant risk categories.

Advanced jailbreaking doesn't appear anywhere in the methodology. We're not talking about simple prompt injection, the "pretend you're an evil AI" stuff that gets patched in a week. We're talking about sophisticated, multi-step techniques that exploit edge cases in the model's training. The kind of attacks that take research to discover and weeks to fix. Those aren't tested.

Conversational dark patterns also get a pass. Can Claude manipulate you emotionally? Can it use rhetorical techniques to make bad ideas sound reasonable? Can it exploit cognitive biases in ways that feel helpful but actually aren't? The report doesn't say, because it doesn't ask.

Then there's the issue of subtle reasoning errors. Not the obvious failures where Claude hallucinates a fact or invents a citation. The dangerous ones are the arguments that sound airtight but contain logical flaws you won't catch unless you're really paying attention. That's not in the threat model either.

The Boeing Problem: When Companies Grade Their Own Homework

There's a pattern here, and it's not unique to AI. Remember the Boeing 737 MAX? The FAA let Boeing essentially self-certify critical safety systems. Boeing's engineers ran the tests, evaluated the results, and assured regulators everything was fine. Two planes crashed, killing 346 people.

Self-assessment creates a fundamental conflict of interest. Anthropic has every incentive to design tests that Claude will pass. Not because they're malicious, but because finding a critical flaw means delayed releases, unhappy investors, and a competitive disadvantage against OpenAI and Google.

The financial audit metaphor works perfectly here. Imagine a company publishing an audit that only examines profitable divisions and conveniently ignores the money-losing ones. You wouldn't trust the results. Same principle applies to AI safety reports.

What This Means for Actual Users

If you're using Claude in your workflow, the ASL-2 classification doesn't mean you can trust it blindly. The report tells you Claude probably won't help terrorists or hack the Pentagon. That's good news, but it's not the same as "this AI won't mislead you in everyday use."

Here's the practical takeaway: verify complex reasoning, especially in high-stakes decisions. If Claude's giving you legal advice, medical information, or financial guidance, double-check the logic. Don't just scan for factual errors, look for reasoning gaps.

Don't trust emotional appeals. If Claude's response makes you feel a certain way, that's a feature of its training, not proof that it's right. The model learned to sound confident and reassuring. That doesn't make it correct.

Stay skeptical of company-published safety reports. Not just Anthropic's, everyone's. Until we have independent third-party auditing for AI systems, the way we have for financial statements or medical devices, treat these reports as marketing documents that happen to contain useful data.

The RSP framework is a step forward. Anthropic's doing more safety testing than most competitors. But "better than average" isn't the same as "good enough," and what they're not testing might matter more than what they are.

Topics covered:

SecurityAnthropicAnalysis

Frequently asked questions

What is Anthropic's RSP?
The RSP (Responsible Scaling Policy) is Anthropic's internal safety framework. It defines risk levels (ASL-1 through ASL-4) for AI models. Claude is currently classified as ASL-2, meaning its capabilities don't exceed publicly available information.
What doesn't the report test?
The report omits advanced jailbreaking (sophisticated techniques to bypass safeguards), conversational dark patterns (emotional manipulation via AI), and subtle reasoning errors (logical conclusions that seem correct but aren't).
Is Claude actually safe?
According to the report, yes, at ASL-2 level. But this assessment is limited. It doesn't cover sophisticated manipulation scenarios or the risks of real-world use by millions of non-expert users.
Why can a safety report be misleading?
An internal report chooses which scenarios to test. What it omits is just as significant as what it tests. It's the equivalent of a financial audit that only looks at the good accounts.
What should Claude users do?
You should verify complex reasoning generated by Claude, not treat emotional responses as reliable, and stay critical of safety reports published by the companies themselves.
The free AI newsletter