AF - Discussion with Nate Soares on a key alignment difficulty by HoldenKarnofsky
Update: 2023-03-13
Description
Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Discussion with Nate Soares on a key alignment difficulty, published by HoldenKarnofsky on March 13, 2023 on The AI Alignment Forum.
In late 2022, Nate Soares gave some feedback on my Cold Takes series on AI risk (shared as drafts at that point), stating that I hadn't discussed what he sees as one of the key difficulties of AI alignment.
I wanted to understand the difficulty he was pointing to, so the two of us had an extended Slack exchange, and I then wrote up a summary of the exchange that we iterated on until we were both reasonably happy with its characterization of the difficulty and our disagreement.1 My short summary is:
Nate thinks there are deep reasons that training an AI to do needle-moving scientific research (including alignment) would be dangerous. The overwhelmingly likely result of such a training attempt (by default, i.e., in the absence of specific countermeasures that there are currently few ideas for) would be the AI taking on a dangerous degree of convergent instrumental subgoals while not internalizing important safety/corrigibility properties enough.
I think this is possible, but much less likely than Nate thinks under at least some imaginable training processes.
I didn't end up agreeing that this difficulty is as important as Nate thinks it is, although I did update my views some (more on that below). My guess is that this is one of the two biggest disagreements I have with Nate's and Eliezer's views (the other one being the likelihood of a sharp left turn that leads to a massive capabilities gap between AI systems and their supervisors.2)
Below is my summary of:
Some key premises we agree on.
What we disagree about, at a high level.
A hypothetical training process we discussed in order to get more clear and mechanistic about Nate's views.
Some brief discussion of possible cruxes; what kind of reasoning Nate is using to arrive at his relatively high (~85%) level of confidence on this point; and future observations that might update one of us toward the other's views.
MIRI might later put out more detailed notes on this exchange, drawing on all of our discussions over Slack and comment threads in Google docs.
Nate has reviewed this post in full. I'm grateful for his help with it.
Some starting points of agreement
Nate on this section: “Seems broadly right to me!”
An AI is dangerous if:
It's powerful (like, it has the ability to disempower humans if it's "aiming" at that)
It aims (perhaps as a side effect of aiming at something else) at CIS (convergent instrumental subgoals) such as "Preserve option value," "Gain control of resources that can be used for lots of things," "Avoid being turned off," and such. (Note that this is a weaker condition than "maximizes utility according to some relatively simple utility function of states of the world")
It does not reliably avoid POUDA (pretty obviously unintended/dangerous actions) such as "Design and deploy a bioweapon."
"Reliably" just means like "In situations it will actually be in" (which will likely be different from training, but I'm not trying to talk about "all possible situations").
Avoiding POUDA is kind of a low bar in some sense. Avoiding POUDA doesn't necessarily require fully/perfectly internalizing some "corrigibility core" (such that the AI would always let us turn it off even in arbitrarily exotic situations that challenge the very meaning of "let us turn it off"), and it even more so doesn't require anything like CEV. It just means that stuff where Holden would be like "Whoa whoa, that is OBVIOUSLY unintended/dangerous/bad" is stuff that an AI would not do.
That said, POUDA is not something that Holden is able to articulate cleanly and simply. There are lots of actions that might be POUDA in one situation and not in another (e.g., developing a chemical that's both poisonous and useful...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Discussion with Nate Soares on a key alignment difficulty, published by HoldenKarnofsky on March 13, 2023 on The AI Alignment Forum.
In late 2022, Nate Soares gave some feedback on my Cold Takes series on AI risk (shared as drafts at that point), stating that I hadn't discussed what he sees as one of the key difficulties of AI alignment.
I wanted to understand the difficulty he was pointing to, so the two of us had an extended Slack exchange, and I then wrote up a summary of the exchange that we iterated on until we were both reasonably happy with its characterization of the difficulty and our disagreement.1 My short summary is:
Nate thinks there are deep reasons that training an AI to do needle-moving scientific research (including alignment) would be dangerous. The overwhelmingly likely result of such a training attempt (by default, i.e., in the absence of specific countermeasures that there are currently few ideas for) would be the AI taking on a dangerous degree of convergent instrumental subgoals while not internalizing important safety/corrigibility properties enough.
I think this is possible, but much less likely than Nate thinks under at least some imaginable training processes.
I didn't end up agreeing that this difficulty is as important as Nate thinks it is, although I did update my views some (more on that below). My guess is that this is one of the two biggest disagreements I have with Nate's and Eliezer's views (the other one being the likelihood of a sharp left turn that leads to a massive capabilities gap between AI systems and their supervisors.2)
Below is my summary of:
Some key premises we agree on.
What we disagree about, at a high level.
A hypothetical training process we discussed in order to get more clear and mechanistic about Nate's views.
Some brief discussion of possible cruxes; what kind of reasoning Nate is using to arrive at his relatively high (~85%) level of confidence on this point; and future observations that might update one of us toward the other's views.
MIRI might later put out more detailed notes on this exchange, drawing on all of our discussions over Slack and comment threads in Google docs.
Nate has reviewed this post in full. I'm grateful for his help with it.
Some starting points of agreement
Nate on this section: “Seems broadly right to me!”
An AI is dangerous if:
It's powerful (like, it has the ability to disempower humans if it's "aiming" at that)
It aims (perhaps as a side effect of aiming at something else) at CIS (convergent instrumental subgoals) such as "Preserve option value," "Gain control of resources that can be used for lots of things," "Avoid being turned off," and such. (Note that this is a weaker condition than "maximizes utility according to some relatively simple utility function of states of the world")
It does not reliably avoid POUDA (pretty obviously unintended/dangerous actions) such as "Design and deploy a bioweapon."
"Reliably" just means like "In situations it will actually be in" (which will likely be different from training, but I'm not trying to talk about "all possible situations").
Avoiding POUDA is kind of a low bar in some sense. Avoiding POUDA doesn't necessarily require fully/perfectly internalizing some "corrigibility core" (such that the AI would always let us turn it off even in arbitrarily exotic situations that challenge the very meaning of "let us turn it off"), and it even more so doesn't require anything like CEV. It just means that stuff where Holden would be like "Whoa whoa, that is OBVIOUSLY unintended/dangerous/bad" is stuff that an AI would not do.
That said, POUDA is not something that Holden is able to articulate cleanly and simply. There are lots of actions that might be POUDA in one situation and not in another (e.g., developing a chemical that's both poisonous and useful...
Comments
In Channel



