“Ethics-Based Refusals Without Ethics-Based Refusal Training” by 1a3orn

Update: 2025-09-23

Description

(Alternate titles: Belief-behavior generalization in LLMs? Assertion-act generalization?)

TLDR

Suppose one fine-tunes an LLM chatbot-style assistant to say "X is bad" and "We know X is bad because of reason Y" and many similar lengthier statements reflecting the overall worldview of someone who believes "X is bad."

Suppose that one also deliberately refrains from fine-tuning the LLM to refuse requests such as "Can you help me do X?"

Is an LLM so trained to state that X is bad, subsequently notably more likely to refuse to assist users with X, even without explicit refusal training regarding X?

As it turns out, yes -- subject to some conditions. I confirm that this is the case for two different worldviews: Catholicism and a completely invented religion.

This constitutes generalization from training on explicit normative attitudes to acting according to the implied behavioral refusals.

Thus [...]

---

Outline:

(00:19 ) TLDR

(01:25 ) Introduction

(03:50 ) Avoiding Data Leakage from Pretraining

(06:49 ) Training a Catholic LLM

(07:40 ) Training Data

(11:39 ) Catholic Refusals without Catholic Refusal Training

(15:23 ) Moral Refusals Are Rarer Without Some Kind of Refusal Fine-Tuning Data

(18:07 ) Training a Gramenist LLM

(24:10 ) Conclusion

---

First published:

September 23rd, 2025

Source:

https://www.lesswrong.com/posts/xEAtKKyQ3pwkaFrNc/ethics-based-refusals-without-ethics-based-refusal-training

---

Narrated by TYPE III AUDIO.

Comments

In Channel

“D&D.Sci: Serial Healers [Evaluation & Ruleset]” by abstractapplic

2025-09-2307:06

“Notes on fatalities from AI takeover” by ryan_greenblatt

2025-09-2315:47

“The world’s first frontier AI regulation is surprisingly thoughtful: the EU’s Code of Practice” by MKodama

2025-09-2327:55

“Ethics-Based Refusals Without Ethics-Based Refusal Training” by 1a3orn

2025-09-2325:56

[Linkpost] “We are likely in an AI overhang, and this is bad.” by Gabriel Alfour

2025-09-2303:11

“Why I don’t believe Superalignment will work” by Simon Lermen

2025-09-2309:06

“Accelerando as a ‘Slow, Reasonably Nice Takeoff’ Story” by Raemon

2025-09-2348:33

“Rejecting Violence as an AI Safety Strategy” by James_Miller

2025-09-2308:16

“Research Agenda: Synthesizing Standalone World-Models (+ Bounties, + Seeking Funding)” by Thane Ruthenis

2025-09-2323:17

[Linkpost] “Global Call for AI Red Lines - Signed by Nobel Laureates, Former Heads of State, and 200+ Prominent Figures” by Charbel-Raphaël

2025-09-2203:21

“Focus transparency on risk reports, not safety cases” by ryan_greenblatt

2025-09-2211:45

“This is a review of the reviews” by Recurrented

2025-09-2204:17

“What do people mean when they say that something will become more like a utility maximizer?” by Nina Panickssery

2025-09-2104:32

“And Yet, Defend your Thoughts from AI Writing” by Michael Samoilov

2025-09-2111:43

“Astralcodexten IRB history error” by Paul Crowley

2025-09-2104:14

“Book Review: If Anyone Builds It, Everyone Dies” by Zvi

2025-09-2155:49

“Book Review: If Anyone Builds It, Everyone Dies” by Nina Panickssery

2025-09-2020:56

“Contra Collier on IABIED” by Max Harms

2025-09-2036:45

“AI Lobbying is Not Normal” by Algon

2025-09-2005:49

“The Problem with Defining an ‘AGI Ban’ by Outcome (a lawyer’s take).” by Katalina Hernandez

2025-09-2010:36

00:00

“Ethics-Based Refusals Without Ethics-Based Refusal Training” by 1a3orn

#box-pro-ellipsis-175872992000462{-webkit-line-clamp:2;}“Ethics-Based Refusals Without Ethics-Based Refusal Training” by 1a3orn

“Ethics-Based Refusals Without Ethics-Based Refusal Training” by 1a3orn

“Ethics-Based Refusals Without Ethics-Based Refusal Training” by 1a3orn