Alignment Newsletter #169: Collaborating with humans without human data

Update: 2021-11-24

Description

Recorded by Robert Miles: http://robertskmiles.com

More information about the newsletter here: https://rohinshah.com/alignment-newsletter/

YouTube Channel: https://www.youtube.com/channel/UCfGGFXwKpr-TJ5HfxEFaFCg

HIGHLIGHTS

Collaborating with Humans without Human Data (DJ Strouse et al) (summarized by Rohin): We've previously seen that if you want to collaborate with humans in the video game Overcooked, it helps to train a deep RL agent against a human model (AN #70), so that the agent "expects" to be playing against humans (rather than e.g. copies of itself, as in self-play). We might call this a "human-aware" model. However, since a human-aware model must be trained against a model that imitates human gameplay, we need to collect human gameplay data for training. Could we instead train an agent that is robust enough to play with lots of different agents, including humans as a special case?

This paper shows that this can be done with Fictitious Co-Play (FCP), in which we train our final agent against a population of self-play agents and their past checkpoints taken throughout training. Such agents get significantly higher rewards when collaborating with humans in Overcooked (relative to the human-aware approach in the previously linked paper).

In their ablations, the authors find that it is particularly important to include past checkpoints in the population against which you train. They also test whether it helps to have the self-play agents have a variety or architectures, and find that it mostly does not make a difference (as long as you are using past checkpoints as well).

Rohin's opinion: You could imagine two different philosophies on how to build AI systems -- the first option is to train them on the actual task of interest (for Overcooked, training agents to play against humans or human models), while the second option is to train a more robust agent on some more general task, that hopefully includes the actual task within it (the approach in this paper). Besides Overcooked, another example would be supervised learning on some natural language task (the first philosophy), as compared to pretraining on the Internet GPT-style and then prompting the model to solve your task of interest (the second philosophy). In some sense the quest for a single unified AGI system is itself a bet on the second philosophy -- first you build your AGI that can do all tasks, and then you point it at the specific task you want to do now.

Historically, I think AI has focused primarily on the first philosophy, but recent years have shown the power of the second philosophy. However, I don't think the question is settled yet: one issue with the second philosophy is that it is often difficult to fully "aim" your system at the true task of interest, and as a result it doesn't perform as well as it "could have". In Overcooked, the FCP agents will not learn specific quirks of human gameplay that could be exploited to improve efficiency (which the human-aware agent could do, at least in theory). In natural language, even if you prompt GPT-3 appropriately, there's still some chance it ends up rambling about something else entirely, or neglects to mention some information that it "knows" but that a human on the Internet would not have said. (See also this post (AN #141).)

I should note that you can also have a hybrid approach, where you start by training a large model with the second philosophy, and then you finetune it on your task of interest as in the first philosophy, gaining the benefits of both.

I'm generally interested in which approach will build more useful agents, as this seems quite relevant to forecasting the future of AI (which in turn affects lots of things including AI alignment plans).

TECHNICAL AI ALIGNMENT

LEARNING HUMAN INTENT

Inverse Decision Modeling: Learning Interpretable Representations of Behavior (Daniel Jarrett, Alihan Hüyük et al) (summarized by Rohin): There's lots of work on learning preferences from demonstrations, which varies in how much structure they assume on the demonstrator: for example, we might consider them to be Boltzmann rational (AN #12) or risk sensitive, or we could try to learn their biases (AN #59). This paper proposes a framework to encompass all of these choices: the core idea is to model the demonstrator as choosing actions according to a planner; some parameters of this planner are fixed in advance to provide an assumption on the structure of the planner, while others are learned from data. This also allows them to separate beliefs, decision-making, and rewards, so that different structures can be imposed on each of them individually.

The paper provides a mathematical treatment of both the forward problem (how to compute actions in the planner given the reward, think of algorithms like value iteration) and the backward problem (how to compute the reward given demonstrations, the typical inverse reinforcement learning setting). They demonstrate the framework on a medical dataset, where they introduce a planner with parameters for flexibility of decision-making, optimism of beliefs, and adaptivity of beliefs. In this case they specify the desired reward function and then run backward inference to conclude that, with respect to this reward function, clinicians appear to be significantly less optimistic when diagnosing dementia in female and elderly patients.

Rohin's opinion: One thing to note about this paper is that it is an incredible work of scholarship; it fluently cites research across a variety of disciplines including AI safety, and provides a useful organizing framework for many such papers. If you need to do a literature review on inverse reinforcement learning, this paper is a good place to start.

Human irrationality: both bad and good for reward inference (Lawrence Chan et al) (summarized by Rohin): Last summary, we saw a framework for inverse reinforcement learning with suboptimal demonstrators

Comments

In Channel

Alignment Newsletter #173: Recent language model results from DeepMind

2022-07-2116:43

Alignment Newsletter #172: Sorry for the long hiatus!

2022-07-0505:52

Alignment Newsletter #171: Disagreements between alignment "optimists" and "pessimists"

2022-01-2314:21

Alignment Newsletter #170: Analyzing the argument for risk from power-seeking AI

2021-12-0813:01

Alignment Newsletter #169: Collaborating with humans without human data

2021-11-2415:08

Alignment Newsletter #168: Four technical topics for which Open Phil is soliciting grant proposals

2021-10-2816:21

Alignment Newsletter #167: Concrete ML safety problems and their relevance to x-risk

2021-10-2017:10

Alignment Newsletter #166: Is it crazy to claim we're in the most important century?

2021-10-0815:42

Alignment Newsletter #165: When large models are more likely to lie

2021-09-2216:05

Alignment Newsletter #164: How well can language models write code?

2021-09-1518:40

Alignment Newsletter #163: Using finite factored sets for causal and temporal inference

2021-09-0819:27

Alignment Newsletter #162: Foundation models: a paradigm shift within AI

2021-08-2715:46

Alignment Newsletter #161: Creating generalizable reward functions for multiple tasks by learning a model of functional similarity

2021-08-2017:38

Alignment Newsletter #160: Building AIs that learn and think like people

2021-08-1317:26

Alignment Newsletter #159: Building agents that know how to experiment, by training on procedurally generated games

2021-08-0427:00

Alignment Newsletter #158: Should we be optimistic about generalization?

2021-07-2915:39

Alignment Newsletter #157: Measuring misalignment in the technology underlying Copilot

2021-07-2314:17

Alignment Newsletter #156: The scaling hypothesis: a plan for building AGI

2021-07-1614:17

Alignment Newsletter #155: A Minecraft benchmark for algorithms that learn without reward functions

2021-07-0812:43

Alignment Newsletter #154: What economic growth theory has to say about transformative AI

2021-06-3016:05

00:00

1.0x

Alignment Newsletter #169: Collaborating with humans without human data

#box-pro-ellipsis-176556356490137{-webkit-line-clamp:2;}Alignment Newsletter #169: Collaborating with humans without human data

HIGHLIGHTS

TECHNICAL AI ALIGNMENT

LEARNING HUMAN INTENT

Alignment Newsletter #169: Collaborating with humans without human data

Alignment Newsletter #169: Collaborating with humans without human data