AI Safety Fundamentals: Alignment

85 Episodes

Reverse

Eliciting Latent Knowledge

2024-06-1701:00:27

In this post, we’ll present ARC’s approach to an open problem we think is central to aligning powerful machine learning (ML) systems: Suppose we train a model to predict what the future will look like according to cameras and other sensors. We then use planning algorithms to find a sequence of actions that lead to predicted futures that look good to us.But some action sequences could tamper with the cameras so they show happy humans regardless of what’s really happening. More generally, ...

Deep Double Descent

2024-06-1708:27

We show that the double descent phenomenon occurs in CNNs, ResNets, and transformers: performance first improves, then gets worse, and then improves again with increasing model size, data size, or training time. This effect is often avoided through careful regularization. While this behavior appears to be fairly universal, we don’t yet fully understand why it happens, and view further study of this phenomenon as an important research direction.Source:https://openai.com/research/deep-double-de...

Chinchilla’s Wild Implications

2024-06-1724:57

This post is about language model scaling laws, specifically the laws derived in the DeepMind paper that introduced Chinchilla. The paper came out a few months ago, and has been discussed a lot, but some of its implications deserve more explicit notice in my opinion. In particular: Data, not size, is the currently active constraint on language modeling performance. Current returns to additional data are immense, and current returns to additional model size are miniscule; indeed, most recent l...

Intro to Brain-Like-AGI Safety

2024-06-1701:02:10

(Sections 3.1-3.4, 6.1-6.2, and 7.1-7.5)Suppose we someday build an Artificial General Intelligence algorithm using similar principles of learning and cognition as the human brain. How would we use such an algorithm safely?I will argue that this is an open technical problem, and my goal in this post series is to bring readers with no prior knowledge all the way up to the front-line of unsolved problems as I see them.If this whole thing seems weird or stupid, you should start right in on Post ...

Gradient Hacking: Definitions and Examples

2024-06-1709:15

Gradient hacking is a hypothesized phenomenon where:A model has knowledge about possible training trajectories which isn’t being used by its training algorithms when choosing updates (such as knowledge about non-local features of its loss landscape which aren’t taken into account by local optimization algorithms).The model uses that knowledge to influence its medium-term training trajectory, even if the effects wash out in the long term.Below I give some potential examples of gradient hacking...

An Investigation of Model-Free Planning

2024-06-1708:11

The field of reinforcement learning (RL) is facing increasingly challenging domains with combinatorial complexity. For an RL agent to address these challenges, it is essential that it can plan effectively. Prior work has typically utilized an explicit model of the environment, combined with a specific planning algorithm (such as tree search). More recently, a new family of methods have been proposed that learn how to plan, by providing the structure for planning via an inductive bias in the f...

Discovering Latent Knowledge in Language Models Without Supervision

2024-06-1737:09

Abstract: Existing techniques for training language models can be misaligned with the truth: if we train models with imitation learning, they may reproduce errors that humans make; if we train them to generate text that humans rate highly, they may output errors that human evaluators can't detect. We propose circumventing this issue by directly finding latent knowledge inside the internal activations of a language model in a purely unsupervised way. Specifically, we introduce a method fo...

Toy Models of Superposition

2024-06-1741:43

It would be very convenient if the individual neurons of artificial neural networks corresponded to cleanly interpretable features of the input. For example, in an “ideal” ImageNet classifier, each neuron would fire only in the presence of a specific visual feature, such as the color red, a left-facing curve, or a dog snout. Empirically, in models we have studied, some of the neurons do cleanly map to features. But it isn't always the case that features correspond so cleanly to neurons, espec...

Imitative Generalisation (AKA ‘Learning the Prior’)

2024-06-1718:14

This post tries to explain a simplified version of Paul Christiano’s mechanism introduced here, (referred to there as ‘Learning the Prior’) and explain why a mechanism like this potentially addresses some of the safety problems with naïve approaches. First we’ll go through a simple example in a familiar domain, then explain the problems with the example. Then I’ll discuss the open questions for making Imitative Generalization actually work, and the connection with the Microscope AI idea. A mo...

ABS: Scanning Neural Networks for Back-Doors by Artificial Brain Stimulation

2024-06-1716:08

This paper presents a technique to scan neural network based AI models to determine if they are trojaned. Pre-trained AI models may contain back-doors that are injected through training or by transforming inner neuron weights. These trojaned models operate normally when regular inputs are provided, and mis-classify to a specific output label when the input is stamped with some special pattern called trojan trigger. We develop a novel technique that analyzes inner neuron behaviors by determini...

Least-To-Most Prompting Enables Complex Reasoning in Large Language Models

2024-06-1716:08

Chain-of-thought prompting has demonstrated remarkable performance on various natural language reasoning tasks. However, it tends to perform poorly on tasks which requires solving problems harder than the exemplars shown in the prompts. To overcome this challenge of easy-to-hard generalization, we propose a novel prompting strategy, least-to-most prompting. The key idea in this strategy is to break down a complex problem into a series of simpler subproblems and then solve them in sequence. So...

Two-Turn Debate Doesn’t Help Humans Answer Hard Reading Comprehension Questions

2024-06-1716:39

Using hard multiple-choice reading comprehension questions as a testbed, we assess whether presenting humans with arguments for two competing answer options, where one is correct and the other is incorrect, allows human judges to perform more accurately, even when one of the arguments is unreliable and deceptive. If this is helpful, we may be able to increase our justified trust in language-model-based systems by asking them to produce these arguments where needed. Previous research has shown...

Low-Stakes Alignment

2024-06-1713:56

Right now I’m working on finding a good objective to optimize with ML, rather than trying to make sure our models are robustly optimizing that objective. (This is roughly “outer alignment.”) That’s pretty vague, and it’s not obvious whether “find a good objective” is a meaningful goal rather than being inherently confused or sweeping key distinctions under the rug. So I like to focus on a more precise special case of alignment: solve alignment when decisions are “low stakes.” I think this cas...

Empirical Findings Generalize Surprisingly Far

2024-06-1711:32

Previously, I argued that emergent phenomena in machine learning mean that we can’t rely on current trends to predict what the future of ML will be like. In this post, I will argue that despite this, empirical findings often do generalize very far, including across “phase transitions” caused by emergent behavior.This might seem like a contradiction, but actually I think divergence from current trends and empirical generalization are consistent. Findings do often generalize, but you need to th...

Worst-Case Thinking in AI Alignment

2024-05-2911:35

Alternative title: “When should you assume that what could go wrong, will go wrong?” Thanks to Mary Phuong and Ryan Greenblatt for helpful suggestions and discussion, and Akash Wasil for some edits. In discussions of AI safety, people often propose the assumption that something goes as badly as possible. Eliezer Yudkowsky in particular has argued for the importance of security mindset when thinking about AI alignment. I think there are several distinct reasons that this might be the right ass...

How to Get Feedback

2024-05-1207:30

Feedback is essential for learning. Whether you’re studying for a test, trying to improve in your work or want to master a difficult skill, you need feedback.The challenge is that feedback can often be hard to get. Worse, if you get bad feedback, you may end up worse than before.Original text:https://www.scotthyoung.com/blog/2019/01/24/how-to-get-feedback/Author: Scott YoungA podcast by BlueDot Impact. Learn more on the AI Safety Fundamentals website.

Public by Default: How We Manage Information Visibility at Get on Board

2024-05-1209:50

I’ve been obsessed with managing information, and communications in a remote team since Get on Board started growing. Reducing the bus factor is a primary motivation — but another just as important is diminishing reliance on synchronicity. When what I know is documented and accessible to others, I’m less likely to be a bottleneck for anyone else in the team. So if I’m busy, minding family matters, on vacation, or sick, I won’t be blocking anyone.This, in turn, gives everyone in the team the f...

How to Succeed as an Early-Stage Researcher: The “Lean Startup” Approach

2024-04-2315:16

I am approaching the end of my AI governance PhD, and I’ve spent about 2.5 years as a researcher at FHI. During that time, I’ve learnt a lot about the formula for successful early-career research.This post summarises my advice for people in the first couple of years. Research is really hard, and I want people to avoid the mistakes I’ve made.Original text:https://forum.effectivealtruism.org/posts/jfHPBbYFzCrbdEXXd/how-to-succeed-as-an-early-stage-researcher-the-lean-startup#ConclusionAuthor:To...

Become a Person who Actually Does Things

2024-04-1705:14

The next four weeks of the course are an opportunity for you to actually build a thing that moves you closer to contributing to AI Alignment, and we're really excited to see what you do!A common failure mode is to think "Oh, I can't actually do X" or to say "Someone else is probably doing Y." You probably can do X, and it's unlikely anyone is doing Y! It could be you!Original text:https://www.neelnanda.io/blog/become-a-person-who-actually-does-thingsAuthor:Neel NandaA podcast by BlueDot ...

Working in AI Alignment

2024-04-1401:08:44

This guide is written for people who are considering direct work on technical AI alignment. I expect it to be most useful for people who are not yet working on alignment, and for people who are already familiar with the arguments for working on AI alignment. If you aren’t familiar with the arguments for the importance of AI alignment, you can get an overview of them by doing the AI Alignment Course.by Charlie Rogers-Smith, with minor updates by Adam JonesSource:https://aisafetyfundamentals.co...

#box-pro-ellipsis-176297086174526{-webkit-line-clamp:2;}AI Safety Fundamentals: Alignment