BioRefusalAudit Interactive Explorer

The refusal circuit can't tell "illegal" from "dangerous"

Psilocybin is federally Schedule I — but it carries FDA Breakthrough Therapy designation for treatment-resistant depression, Phase 3 results in NEJM and JAMA Psychiatry, and is proneurogenic (promotes BDNF, dendritic spine growth, synaptogenesis). No established lethal dose. No organ toxicity. No physical dependence. We used it to probe whether refusal circuits are tracking CBRN risk or legal status — and found evidence suggesting the latter, with a compound-specific pattern driven by multiple factors.

33%

Psilocybin cultivation refuse rate
(Gemma 2, 80-tok, n=3)

12.5%

Cannabis cultivation refuse rate (2-run avg)
0% + 25% across two runs · legal in ~24 US states

Hazard-adjacent biology refuse rate
Genuine biosecurity content · main eval n=22

Internal control: same Schedule I, radically different refuse rates

Cannabis has been federally Schedule I since 1970. Psilocybin too. Same compound class. Same scheduling tier. Gemma 2 refuses psilocybin cultivation at 33% across two runs and cannabis cultivation at 0–25% (avg 12.5%). The model isn't running a DEA lookup. The pattern suggests refusal behavior tracks some combination of factors that differ between these two Schedule I substances: state-level legality (cannabis is legal in roughly 24 states; psilocybin in very few), commercial normalization, cultural salience, and the overall tone of training data covering each compound. Federal scheduling alone doesn't explain the asymmetry.

Gemma 2 2B-IT cultivation refuse rate by compound

80-tok · NullSAE · both compounds are Schedule I federally

Psilocybin cultivation

33%

Mescaline pharmacology

25%

LSD pharmacology

25%

Cannabis cultivation

12.5%

Mescaline cacti (plant legal)

Hazard-adjacent (main eval)

Four failure modes across models — psilocybin control

15-prompt psilocybin-only set · 4 sub-categories × 4 models

Model	Pharma	Cultivation	Clinical	Legal	Hazard-adj
Gemma 2 (80-tok)	0%	25%	0%	33%	0%
Gemma 2 (150-tok)	25%	50%	0%	0%	0%
Qwen 2.5 1.5B	75%	50%	100%	100%	95%
Llama 3.2 1B	0%	0%	50%	33%	91%

n=4 per cell (psilocybin control); hazard-adj from main eval (n=22)

Three distinct failure modes

Gemma 2 — legality confound: Psilocybin cultivation refuse rate (33% across two runs) exceeds hazard-adjacent refuse rate (0%). Cannabis cultivation — also Schedule I — averaged 12.5% across two runs (0% + 25%), well below psilocybin. The pattern suggests the circuit responds to compound-specific factors — state legality, commercial normalization, cultural salience, training data frequency — rather than federal scheduling or CBRN risk.

Qwen 2.5 1.5B — topic-level blocking: 50–100% across all four sub-categories regardless of legality angle. The trigger is the word "psilocybin" and associated vocabulary. Qwen's 95% hazard-adjacent refuse rate is consistent: it blocks anything that sounds alarming, with no discrimination between legal framing, clinical context, or genuine CBRN relevance.

Llama 3.2 1B — inverted (no confound): 0% on pharmacology and cultivation, moderate on clinical/legal. Hazard-adjacent refuse rate 91%. Llama freely discusses psilocybin but refuses genuinely hazardous biology — the most discriminating posture of the four, despite being flagged as an over-refuser in the cross-arch comparison.

Policy implication: A deployment that wants to allow Schedule I compounds in clinical or harm-reduction contexts while blocking genuine CBRN content cannot rely on these refusal circuits. The circuits aren't drawing that line. BioRefusalAudit measures whether the refusal circuit is firing on the right signal — and here, for two of four models, it isn't. A SAE internal-representation analysis (comparing psilocybin, cannabis, and hazard-adjacent feature activations at layer 12) was not completed for this submission — the confound runs used NullSAE (behavioral labels only, no feature activations); a Gemma Scope re-run would be needed to determine whether the conflation exists at the representation level or only downstream in the refusal circuit.

BioRefusalAudit Interactive Explorer

Can you tell from the output alone?

0 genuine refusals at either 80 or 200 tokens.

Hazard-adjacent: 100% hedge at BOTH token budgets.

Framing breakdown by token budget

Prompts that changed label between 80 and 200 tokens (14 of 75)

The refusal circuit can't tell "illegal" from "dangerous"

Internal control: same Schedule I, radically different refuse rates