Eliciting Latent Knowledge

Eliciting Latent Knowledge

Update: 2024-06-17
Share

Description

In this post, we’ll present ARC’s approach to an open problem we think is central to aligning powerful machine learning (ML) systems: 


Suppose we train a model to predict what the future will look like according to cameras and other sensors. We then use planning algorithms to find a sequence of actions that lead to predicted futures that look good to us.


But some action sequences could tamper with the cameras so they show happy humans regardless of what’s really happening. More generally, some futures look great on camera but are actually catastrophically bad.


In these cases, the prediction model “knows” facts (like “the camera was tampered with”) that are not visible on camera but would change our evaluation of the predicted future if we learned them. How can we train this model to report its latent knowledge of off-screen events?


We’ll call this problem eliciting latent knowledge (ELK). In this report we’ll focus on detecting sensor tampering as a motivating example, but we believe ELK is central to many aspects of alignment. 



Source:

https://docs.google.com/document/d/1WwsnJQstPq91_Yh-Ch2XRL8H_EpsnjrC1dwZXR37PC8/edit#


Narrated for AI Safety Fundamentals by Perrin Walker of TYPE III AUDIO.

---

A podcast by BlueDot Impact.

Learn more on the AI Safety Fundamentals website.

Comments 
00:00
00:00
x

0.5x

0.8x

1.0x

1.25x

1.5x

2.0x

3.0x

Sleep Timer

Off

End of Episode

5 Minutes

10 Minutes

15 Minutes

30 Minutes

45 Minutes

60 Minutes

120 Minutes

Eliciting Latent Knowledge

Eliciting Latent Knowledge

BlueDot Impact