Alignment Newsletter #168: Four technical topics for which Open Phil is soliciting grant proposals

Update: 2021-10-28

Description

Recorded by Robert Miles: http://robertskmiles.com

More information about the newsletter here: https://rohinshah.com/alignment-newsletter/

YouTube Channel: https://www.youtube.com/channel/UCfGGFXwKpr-TJ5HfxEFaFCg

HIGHLIGHTS

Request for proposals for projects in AI alignment that work with deep learning systems (Nick Beckstead and Asya Bergal) (summarized by Rohin): Open Philanthropy is seeking proposals for AI safety work in four major areas related to deep learning, each of which I summarize below. Proposals are due January 10, and can seek up to $1M covering up to 2 years. Grantees may later be invited to apply for larger and longer grants.

Rohin's opinion: Overall, I like these four directions and am excited to see what comes out of them! I'll comment on specific directions below.

RFP: Measuring and forecasting risks (Jacob Steinhardt) (summarized by Rohin): Measurement and forecasting is useful for two reasons. First, it gives us empirical data that can improve our understanding and spur progress. Second, it can allow us to quantitatively compare the safety performance of different systems, which could enable the creation of safety standards. So what makes for a good measurement?

1. Relevance to AI alignment: The measurement exhibits a failure mode that becomes worse as models become larger, or tracks a potential capability that may emerge with further scale (which in turn could enable deception, hacking, resource acquisition, etc).

2. Forward-looking: The measurement helps us understand future issues, not just those that exist today. Isolated examples of a phenomenon are good if we have nothing else, but we'd much prefer to have a systematic understanding of when a phenomenon occurs and how it tends to quantitatively increase or decrease with various factors. See for example scaling laws (AN #87).

3. Rich data source: Not all trends in MNIST generalize to CIFAR-10, and not all trends in CIFAR-10 generalize to ImageNet. Measurements on data sources with rich factors of variation are more likely to give general insights.

4. Soundness and quality: This is a general category for things like "do we know that the signal isn't overwhelmed by the noise" and "are there any reasons that the measurement might produce false positives or false negatives".

What sorts of things might you measure?

1. As you scale up task complexity, how much do you need to scale up human-labeled data to continue to maintain good performance and avoid reward hacking? If you fail at this and there are imperfections in the reward, how bad does this become?

2. What changes do we observe based on changes in the quality of the human feedback (e.g. getting feedback from amateurs vs experts)? This could give us information about the acceptable "difference in intelligence" between a model and its supervisor.

3. What happens when models are pushed out of distribution along a factor of variation that was not varied in the pretraining data?

4. To what extent do models provide wrong or undesired outputs in contexts where they are capable of providing the right answer?

Rohin's opinion: Measurements generally seem great. One story for impact is that we have a measurement that we think is strongly correlated with x-risk, and we use that measurement to select an AI system that scores low on such a metric. This seems distinctly good and I think would in fact reduce x-risk! But I want to clarify that I don't think it would convince me that the system was safe with high confidence. The conceptual arguments against high confidence in safety seem quite strong and not easily overcome by such measurements. (I'm thinking of objective robustness failures (AN #66) of the form "the model is trying to pursue a simple proxy, but behaves well on the training distribution until it can execute a treacherous turn".)

You can also tell stories where the measurements reveal empirical facts that then help us have high confidence in safety, by allowing us to build better theories and arguments, which can rule out the conceptual arguments above.

Separately, these measurements are also useful as a form of legible evidence about risk to others who are more skeptical of conceptual arguments.

RFP: Techniques for enhancing human feedback (Ajeya Cotra) (summarized by Rohin): Consider a topic previously analyzed in aligning narrowly superhuman models (AN #141): how can we use human feedback to train models to do what we want, in cases where the models are more knowledgeable than the humans providing the feedback? A variety of techniques have been proposed to solve this problem, including iterated amplification (AN #40), debate (AN #5), recursive reward modeling (AN #34), market making (AN #108), and generalizing from short deliberations to long deliberations. This RFP solicits proposals that aim to test these or other mechanisms on existing systems. There are a variety of ways that to set up the experiments so that the models are more knowledgeable than the humans providing the feedback, for example:

1. Train a language model to accurately explain things about a field that the feedback providers are not familiar with.

2. Train an RL agent to act well in an environment where the RL agent can observe more information than the feedback providers can.

3. Train a multilingual model to translate between English and a foreign language that the feedback providers do not know.

RFP: Interpretability (Chris Olah) (summarized by Rohin): The author provides this one sentence summary: We would like to see research building towards the ability to "reverse engineer" tra

Comments

In Channel

Alignment Newsletter #173: Recent language model results from DeepMind
2022-07-2116:43

Alignment Newsletter #172: Sorry for the long hiatus!
2022-07-0505:52

Alignment Newsletter #171: Disagreements between alignment "optimists" and "pessimists"
2022-01-2314:21

Alignment Newsletter #170: Analyzing the argument for risk from power-seeking AI
2021-12-0813:01

Alignment Newsletter #169: Collaborating with humans without human data
2021-11-2415:08

Alignment Newsletter #168: Four technical topics for which Open Phil is soliciting grant proposals
2021-10-2816:21

Alignment Newsletter #167: Concrete ML safety problems and their relevance to x-risk
2021-10-2017:10

Alignment Newsletter #166: Is it crazy to claim we're in the most important century?
2021-10-0815:42

Alignment Newsletter #165: When large models are more likely to lie
2021-09-2216:05

Alignment Newsletter #164: How well can language models write code?
2021-09-1518:40

Alignment Newsletter #163: Using finite factored sets for causal and temporal inference
2021-09-0819:27

Alignment Newsletter #162: Foundation models: a paradigm shift within AI
2021-08-2715:46

Alignment Newsletter #161: Creating generalizable reward functions for multiple tasks by learning a model of functional similarity
2021-08-2017:38

Alignment Newsletter #160: Building AIs that learn and think like people
2021-08-1317:26

Alignment Newsletter #159: Building agents that know how to experiment, by training on procedurally generated games
2021-08-0427:00

Alignment Newsletter #158: Should we be optimistic about generalization?
2021-07-2915:39

Alignment Newsletter #157: Measuring misalignment in the technology underlying Copilot
2021-07-2314:17

Alignment Newsletter #156: The scaling hypothesis: a plan for building AGI
2021-07-1614:17

Alignment Newsletter #155: A Minecraft benchmark for algorithms that learn without reward functions
2021-07-0812:43

Alignment Newsletter #154: What economic growth theory has to say about transformative AI
2021-06-3016:05

00:00

1.0x

Alignment Newsletter #168: Four technical topics for which Open Phil is soliciting grant proposals

#box-pro-ellipsis-176556356403747{-webkit-line-clamp:2;}Alignment Newsletter #168: Four technical topics for which Open Phil is soliciting grant proposals

HIGHLIGHTS

Alignment Newsletter #168: Four technical topics for which Open Phil is soliciting grant proposals

Alignment Newsletter #168: Four technical topics for which Open Phil is soliciting grant proposals