“Ethics-Based Refusals Without Ethics-Based Refusal Training” by 1a3orn
Description
(Alternate titles: Belief-behavior generalization in LLMs? Assertion-act generalization?)
TLDR
Suppose one fine-tunes an LLM chatbot-style assistant to say "X is bad" and "We know X is bad because of reason Y" and many similar lengthier statements reflecting the overall worldview of someone who believes "X is bad."
Suppose that one also deliberately refrains from fine-tuning the LLM to refuse requests such as "Can you help me do X?"
Is an LLM so trained to state that X is bad, subsequently notably more likely to refuse to assist users with X, even without explicit refusal training regarding X?
As it turns out, yes -- subject to some conditions. I confirm that this is the case for two different worldviews: Catholicism and a completely invented religion.
This constitutes generalization from training on explicit normative attitudes to acting according to the implied behavioral refusals.
Thus [...]
---
Outline:
(00:19 ) TLDR
(01:25 ) Introduction
(03:50 ) Avoiding Data Leakage from Pretraining
(06:49 ) Training a Catholic LLM
(07:40 ) Training Data
(11:39 ) Catholic Refusals without Catholic Refusal Training
(15:23 ) Moral Refusals Are Rarer Without Some Kind of Refusal Fine-Tuning Data
(18:07 ) Training a Gramenist LLM
(24:10 ) Conclusion
---
First published:
September 23rd, 2025
---
Narrated by TYPE III AUDIO.