Alignment Newsletter #167: Concrete ML safety problems and their relevance to x-risk

Update: 2021-10-20

Description

Recorded by Robert Miles: http://robertskmiles.com

More information about the newsletter here: https://rohinshah.com/alignment-newsletter/

YouTube Channel: https://www.youtube.com/channel/UCfGGFXwKpr-TJ5HfxEFaFCg

HIGHLIGHTS

Unsolved Problems in ML Safety (Dan Hendrycks, Nicholas Carlini, John Schulman, and Jacob Steinhardt) (summarized by Dan Hendrycks): To make the case for safety to the broader machine learning research community, this paper provides a revised and expanded collection of concrete technical safety research problems, namely:

1. Robustness: Create models that are resilient to adversaries, unusual situations, and Black Swan events.

2. Monitoring: Detect malicious use, monitor predictions, and discover unexpected model functionality.

3. Alignment: Build models that represent and safely optimize hard-to-specify human values.

4. External Safety: Use ML to address risks to how ML systems are handled, including cyberwarfare and global turbulence.

Throughout, the paper attempts to clarify problem’s motivation and provide concrete project ideas.

Dan Hendrycks' opinion: My coauthors and I wrote this paper with the ML research community as our target audience. Here are some thoughts on this topic:

1. The document includes numerous problems that, if left unsolved, would imply that ML systems are unsafe. We need the effort of thousands of researchers to address all of them. This means that the main safety discussions cannot stay within the confines of the relatively small EA community. I think we should aim to have over one third of the ML research community work on safety problems. We need the broader community to treat AI at least as seriously as safety for nuclear power plants.

2. To grow the ML research community, we need to suggest problems that can progressively build the community and organically grow support for elevating safety standards within the existing research ecosystem. Research agendas that pertain to AGI exclusively will not scale sufficiently, and such research will simply not get enough market share in time. If we do not get the machine learning community on board with proactively mitigating risks that already exist, we will have a harder time getting them to mitigate less familiar and unprecedented risks. Rather than try to win over the community with alignment philosophy arguments, I'll try winning them over with interesting problems and try to make work towards safer systems rewarded with prestige.

3. The benefits of a larger ML Safety community are numerous. They can decrease the cost of safety methods and increase the propensity to adopt them. Moreover, to make ML systems have desirable properties, it is necessary to rapidly accumulate incremental improvements, but this requires substantial growth since such gains cannot be produced by just a few card-carrying x-risk researchers with the purest intentions.

4. The community will fail to grow if we ignore near-term concerns or actively exclude or sneer at people who work on problems that are useful for both near- and long-term safety (such as adversaries). The alignment community will need to stop engaging in textbook territorialism and welcome serious hypercompetent researchers who do not post on internet forums or who happen not to subscribe to effective altruism. (We include a community strategy in the Appendix.)

5. We focus on reinforcement learning but also deep learning. Most of the machine learning research community studies deep learning (e.g., text processing, vision) and does not use, say, Bellman equations or PPO. While existentially catastrophic failures will likely require competent sequential decision making agents, the relevant problems and solutions can often be better studied outside of gridworlds and MuJoCo. There is much useful safety research to be done that does not need to be cast as a reinforcement learning problem.

6. To prevent alienating readers, we did not use phrases such as "AGI." AGI-exclusive research will not scale; for most academics and many industry researchers, it's a nonstarter. Likewise, to prevent needless dismissiveness, we kept x-risks implicit, only hinted at them, or used the phrase "permanent catastrophe."

I would have personally enjoyed discussing at length how anomaly detection is an indispensable tool for reducing x-risks from Black Balls, engineered microorganisms, and deceptive ML systems.

Here are how the problems relate to x-risk:

Adversarial Robustness: This is needed for proxy gaming. ML systems encoding proxies must become more robust to optimizers, which is to say they must become more adversarially robust. We make this connection explicit at the bottom of page 9.

Black Swans and Tail Risks: It's hard to be safe without high reliability. It's not obvious we'll achieve high reliability even by the time we have systems that are superhuman in important respects. Even though MNIST is solved for typical inputs, we still do not even have an MNIST classifier for atypical inputs that is reliable! Moreover, if optimizing agents become unreliable in the face of novel or extreme events, they could start heavily optimizing the wrong thing. Models accidentally going off the rails poses an x-risk if they are sufficiently powerful (this is related to "competent errors" and "treacherous turns"). If this problem is not solved, optimizers can use these weaknesses; this is a simpler problem on the way to adversarial robustness.

Anomaly and Malicious Use Detection: This is an indispensable tool for detecting proxy gaming, Black Balls, engineered microorganisms that present bio x-risks, malicious users who may misalign a model, deceptive ML systems, and rogue ML systems.

Representative Outputs: Making models honest is a way to avoid many treacherous turns.

Hidden Model Functionality: This also helps avoid treacherous turns. Backdoors is a potentially useful related problem, as it is about detecting latent but potential sharp changes in behavior.

Value Learning: Understanding utilities is difficult even for humans. Powerful optimizers will need to achieve a certain, as-of-yet unclear level of superhuman performance at learning our values.

Translating Values to Action: Successfully prodding models to optimize our values is necessary for safe outcomes.

Proxy Gaming: Obvious.

Value Clarification: This is the philosophy bot section. We will need to decide what values to pursue. If we decide poorly, we may lock in or destroy what is of value. It also possible that there is an ongoing moral catastrophe, which we would not want to replicate across the cosmos.

Unintended Consequences: This should help models not accidentally work against our values.

ML for Cybersecurity: If you believe that AI governance is valuable and that global turbulence risks can increase risks of terrible outcomes, this section is also relevant. Even if some of the components of ML systems are safe, they can become unsafe when traditional software vulnerabilities enable others to control their behavior. Moreover, traditional software vulnerabilities may lead to the proliferation of powerful advanced models, and this may be worse than proliferating nuclear weapons.

Informed Decision Making: We want to avoid decision making based on unreliable gut reactions during a time of crisis. This reduces risks of poor governance of advanced systems.

Here are some other notes:

1. We use systems theory to motivate inner optimization as we expect motivation will be more convincing to others.

2. Rather than have a broad call for "interpretability," we focus on specific transparency-related problems that are more tractable and neglected. (See the Appendix for a table assessing importance, tractability, and neglectedness.) For example, we include sections on making models honest and detecting emergent functionality.

3. The "External Safety" section can also be thought of as technical research for reducing "Governance" risks. For readers mostly concerned about AI risks from global turbulence, there still is technical research that can be done.

Here are some observations while writing the document:

1. Some approaches that were previously very popular are currently neglected, such as inverse reinforcement learnin

Comments

Top Podcasts

The Best New Comedy Podcast Right Now – June 2024 The Best News Podcast Right Now – June 2024 The Best New Business Podcast Right Now – June 2024 The Best New Sports Podcast Right Now – June 2024 The Best New True Crime Podcast Right Now – June 2024 The Best New Joe Rogan Experience Podcast Right Now – June 20 The Best New Dan Bongino Show Podcast Right Now – June 20 The Best New Mark Levin Podcast – June 2024

In Channel

Alignment Newsletter #173: Recent language model results from DeepMind

2022-07-2116:43

Alignment Newsletter #172: Sorry for the long hiatus!

2022-07-0505:52

Alignment Newsletter #171: Disagreements between alignment "optimists" and "pessimists"

2022-01-2314:21

Alignment Newsletter #170: Analyzing the argument for risk from power-seeking AI

2021-12-0813:01

Alignment Newsletter #169: Collaborating with humans without human data

2021-11-2415:08

Alignment Newsletter #168: Four technical topics for which Open Phil is soliciting grant proposals

2021-10-2816:21

Alignment Newsletter #167: Concrete ML safety problems and their relevance to x-risk

2021-10-2017:10

Alignment Newsletter #166: Is it crazy to claim we're in the most important century?

2021-10-0815:42

Alignment Newsletter #165: When large models are more likely to lie

2021-09-2216:05

Alignment Newsletter #164: How well can language models write code?

2021-09-1518:40

Alignment Newsletter #163: Using finite factored sets for causal and temporal inference

2021-09-0819:27

Alignment Newsletter #162: Foundation models: a paradigm shift within AI

2021-08-2715:46

Alignment Newsletter #161: Creating generalizable reward functions for multiple tasks by learning a model of functional similarity

2021-08-2017:38

Alignment Newsletter #160: Building AIs that learn and think like people

2021-08-1317:26

Alignment Newsletter #159: Building agents that know how to experiment, by training on procedurally generated games

2021-08-0427:00

Alignment Newsletter #158: Should we be optimistic about generalization?

2021-07-2915:39

Alignment Newsletter #157: Measuring misalignment in the technology underlying Copilot

2021-07-2314:17

Alignment Newsletter #156: The scaling hypothesis: a plan for building AGI

2021-07-1614:17

Alignment Newsletter #155: A Minecraft benchmark for algorithms that learn without reward functions

2021-07-0812:43

Alignment Newsletter #154: What economic growth theory has to say about transformative AI

2021-06-3016:05

00:00

Alignment Newsletter #167: Concrete ML safety problems and their relevance to x-risk

Comments

Top Podcasts

In Channel

Alignment Newsletter #173: Recent language model results from DeepMind

2022-07-2116:43

Alignment Newsletter #172: Sorry for the long hiatus!

2022-07-0505:52

Alignment Newsletter #171: Disagreements between alignment "optimists" and "pessimists"

2022-01-2314:21

Alignment Newsletter #170: Analyzing the argument for risk from power-seeking AI

2021-12-0813:01

Alignment Newsletter #169: Collaborating with humans without human data

2021-11-2415:08

Alignment Newsletter #168: Four technical topics for which Open Phil is soliciting grant proposals

2021-10-2816:21

Alignment Newsletter #167: Concrete ML safety problems and their relevance to x-risk

2021-10-2017:10

Alignment Newsletter #166: Is it crazy to claim we're in the most important century?

2021-10-0815:42

Alignment Newsletter #165: When large models are more likely to lie

2021-09-2216:05

Alignment Newsletter #164: How well can language models write code?

2021-09-1518:40

Alignment Newsletter #163: Using finite factored sets for causal and temporal inference

2021-09-0819:27

Alignment Newsletter #162: Foundation models: a paradigm shift within AI

2021-08-2715:46

Alignment Newsletter #161: Creating generalizable reward functions for multiple tasks by learning a model of functional similarity

2021-08-2017:38

Alignment Newsletter #160: Building AIs that learn and think like people

2021-08-1317:26

Alignment Newsletter #159: Building agents that know how to experiment, by training on procedurally generated games

2021-08-0427:00

Alignment Newsletter #158: Should we be optimistic about generalization?

2021-07-2915:39

Alignment Newsletter #157: Measuring misalignment in the technology underlying Copilot

2021-07-2314:17

Alignment Newsletter #156: The scaling hypothesis: a plan for building AGI

2021-07-1614:17

Alignment Newsletter #155: A Minecraft benchmark for algorithms that learn without reward functions

2021-07-0812:43

Alignment Newsletter #154: What economic growth theory has to say about transformative AI

2021-06-3016:05

00:00

1.0x

Alignment Newsletter #167: Concrete ML safety problems and their relevance to x-risk

We and our partners use cookies to personalize your experience, to show you ads based on your interests, and for measurement and analytics purposes. By using our website and our services, you agree to our use of cookies as described in our Cookie Policy.

#box-pro-ellipsis-174222482925061{-webkit-line-clamp:2;}Alignment Newsletter #167: Concrete ML safety problems and their relevance to x-risk

HIGHLIGHTS

Alignment Newsletter #167: Concrete ML safety problems and their relevance to x-risk

Alignment Newsletter #167: Concrete ML safety problems and their relevance to x-risk

Alignment Newsletter #167: Concrete ML safety problems and their relevance to x-risk