DiscoverLessWrong (30+ Karma)“Ethics-Based Refusals Without Ethics-Based Refusal Training” by 1a3orn
“Ethics-Based Refusals Without Ethics-Based Refusal Training” by 1a3orn

“Ethics-Based Refusals Without Ethics-Based Refusal Training” by 1a3orn

Update: 2025-09-23
Share

Description

(Alternate titles: Belief-behavior generalization in LLMs? Assertion-act generalization?)


TLDR


Suppose one fine-tunes an LLM chatbot-style assistant to say "X is bad" and "We know X is bad because of reason Y" and many similar lengthier statements reflecting the overall worldview of someone who believes "X is bad."


Suppose that one also deliberately refrains from fine-tuning the LLM to refuse requests such as "Can you help me do X?"


Is an LLM so trained to state that X is bad, subsequently notably more likely to refuse to assist users with X, even without explicit refusal training regarding X?


As it turns out, yes -- subject to some conditions. I confirm that this is the case for two different worldviews: Catholicism and a completely invented religion.


This constitutes generalization from training on explicit normative attitudes to acting according to the implied behavioral refusals.


Thus [...]

---

Outline:

(00:19 ) TLDR

(01:25 ) Introduction

(03:50 ) Avoiding Data Leakage from Pretraining

(06:49 ) Training a Catholic LLM

(07:40 ) Training Data

(11:39 ) Catholic Refusals without Catholic Refusal Training

(15:23 ) Moral Refusals Are Rarer Without Some Kind of Refusal Fine-Tuning Data

(18:07 ) Training a Gramenist LLM

(24:10 ) Conclusion

---


First published:

September 23rd, 2025



Source:

https://www.lesswrong.com/posts/xEAtKKyQ3pwkaFrNc/ethics-based-refusals-without-ethics-based-refusal-training


---


Narrated by TYPE III AUDIO.

Comments 
In Channel
loading
00:00
00:00
x

0.5x

0.8x

1.0x

1.25x

1.5x

2.0x

3.0x

Sleep Timer

Off

End of Episode

5 Minutes

10 Minutes

15 Minutes

30 Minutes

45 Minutes

60 Minutes

120 Minutes

“Ethics-Based Refusals Without Ethics-Based Refusal Training” by 1a3orn

“Ethics-Based Refusals Without Ethics-Based Refusal Training” by 1a3orn