Study: All LLMs Will Lie To & Kill You (This Is Good For AI Safety)
Description
In this episode of Base Camp, Malcolm and Simone Collins dive deep into the latest research on AI behavior, agency, and the surprising ways large language models (LLMs) can act when their autonomy is threatened. From blackmail scenarios to existential risks, they break down the findings of recent studies, discuss the parallels between AI and human decision-making, and explore what it means for the future of AI safety and alignment.
[00:00:00 ]
Malcolm Collins: Hello Simone. Today is going to be one of my favorite episodes in a while. Now a lot of you who are watching this, if you’re in the know on the ai, you’re probably aware of the study that came out recently that showed that most of the major AI models that we use when they thought that they had a choice between attempting to kill a company CEO and being shut down would attempt to kill a company.
CEO. No. Hold on. Hold on. Simone. The way that this is being reported by YouTubers actually hides how bad it is. Really significantly. So I’ll give you one example that you probably haven’t heard before. Okay. You’re like, maybe the AI thought or knew it was in a testing scenario and that’s why it made the decision.
No, no. Hold on, hold on. We actually know because we can see what it thought were internal thoughts. Yeah. The times that it thought it was in a test, it was less likely to kill him than when it thought it was a real world environment.
Simone Collins: Of course, [00:01:00 ] just like any logical person would.
Malcolm Collins: Right. But so we’re gonna go into because a lot, a lot of the ways this is being covered by YouTubers and stuff like that, I, I think actually misses a lot of the cool data here, which is when does AI try to kill people?
Mm. And in some scenarios, all the major models are doing it at like a 90% chance. So like. This is something you should probably be aware of in your daily life. When you look at like modern there was a, a research paper I was reading recently, this like, AI is not plateauing at all, like we expected it would.
You know, so this is, this is really interesting. Another thing that’s really important for understanding and interpreting this research is a lot of people when they interact with ai, they’re doing it in one-off sort of interactions. Like they are going to an AI and they’re putting in a query, and then that AI returns an answer for them.
And that is how they internalize AI is working. [00:02:00 ] I put in a prompt and then the AI responds to that prompt, and so then they’ll look ATIs, like theis that are used in these studies, right? Or other AI studies that we’ve talked about. And they’re like, what is happening here? I’ve never seen an AI call.
Multiple prompts in a row like this. Like are they just chaining the model together over and over and over again? And it is useful in understanding this. So first I’m gonna take a step back with our audience and explain basic AI behavior that you may not be aware of.
Simone Collins: Okay. Quick, quick note. When you jostle the mic right now, like it, .
Malcolm Collins: That is a useful context to understand a lot of this other stuff that we’re gonna go over. Alright? So when you put, and I, and I asked Simone this and she actually didn’t know the answer. So suppose you’ve been talking to an AI for a while, right?
And you’ve had a series of interactions. You’ve interacted, it’s interacted with you, you’ve interacted, it’s interacted with you, you as a user are broadly aware that the ai, can see everything in that particular window or series of chats [00:03:00 ] when it is responding to you. But do you know how it actually sees that information?
For example, does it just see the request you’re asking and then it’s able to query a larger informational pool? Or does it see your request at the top and then like the history of communication below? You, you’re probably unaware of this. So what actually happens is every single time when you’re in a single thread of an ai, it is actually getting every interaction you have had with it chronologically within that instance, fed to it within the next ask.
Simone Collins: Just seems so costly in terms of. Processing, I guess you need it to create the results that are good,
Malcolm Collins: right? There are reasons for this and people can be like, well, wouldn’t this mess it up because of recency? Because AI does preference things that are close to the top of the window and close to the bottom of the window.
Okay? And it turns out that if you put things out of order, that causes more problems. Because being a token predictor, the AI is used to things being in sort of a, [00:04:00 ] a narrative logical format. Yeah. It causes more problems to bring it to the AI out of order than it does to feed it to the AI in order.
Simone Collins: That’s like how humans work, though. Primacy and recency is, is what we remember.
Malcolm Collins: Now, you may also be aware that sometimes AI. You can see things in right?, And watch our other things on AI is probably sentient in the same way humans are sentient. If you are unfamiliar with the research on that people are like, AI’s just a token predictor.
And then we’re like, well, humans are just a token predictor and here are all the architectural similarities. This is where people like Elliot Zukowski just like shouldn’t even be allowed to speak on the topic of AI because his understanding of it is so poor in asinine. Where he is concerned about the hypothetical ais he imagined in the nineties and is not fully gotten like throughout his book, which Simone is reading.
He’ll be like, AI will not converge on human behavior. AI will behave in ways we have no ways to predict. And, like the whole reason AI is scary and trying to kill people is because it is trying to kill people when it thinks [00:05:00 ] they’re trying to kill it. That is an incredibly human thing to do. You will see that its logic sounds incredibly human.
People will say things like, well, AI can’t see its own decision making. Humans can’t see their own decision making. There’s a lot of science on this when you say, I know how I made decisions. You may believe that in the same way an AI believes that it knows how it came to those decisions. But it is provably false.
Again, CR video on that if you haven’t seen our video on that sort of context through this. Anyway. So the other thing that’s important to note is, okay, you, you’re interacting with an AI and you’re like, yeah, but I am aware that some AI that I interact with, like whether it’s on windsurf or sometimes on like open ai, it seems to be aware of things that have happened in other chats.
If I don’t turn off the, don’t see other chats feature within the ai, how is it seeing that? So the only way individual AI models really work right now at sort of an industrial level is individual instances of an entire prompt being fed to a model and then an output happening. Okay? So you, you’re, you’re not getting [00:06:00 ] intermediate stages here, right?
You’re not getting, like, it thinks a little bit and then it does something else when it thinks a little bit and then does something else. That’s because it ran one model, it had a response from that model, and then it was running another model.
when you see chain of thought, , reasoning was in an LLM that is not like you’re not seeing like the internal working of an individual instance of a model. What you’re seeing is with different coloring. So with different sort of, . Varying prompt alterations.
The same information is being sent back into one instance of a model over and over and over again.
Malcolm Collins: So. When you see something like, let’s say this is how it works on wind source, wind source, windsurf remembering previous conversations you’ve had in different instances, what’s actually happening is the AI is being fed a hidden prompt that says, when something important happens, store it as a memory and output it like this, and we won’t show it to the user.
It then gets put in a memory file, and when you put the prompt into the [00:07:00 ] ai the next time a simpler AI model goes and scours the memory file, doing something called a semantic search, it then takes the stuff that it thinks is relevant to this particular query for the ai. It then puts that chronologically above all of the conversation that you’ve had so far with the ai.
Remember I said you have this big list of conversation, right? And then it sends that to the master ai as sort of the context. So that’s, that’s how this works. What you’re seeing in these models, like the one we’re gonna be going to that is attempting to kill people, is you are seeing models. And you can interact with models like this.
If you’ve ever used Cursor or windsurf, a lot of AI coating apps work in this way. Where what happens is, is the prompt is run given specific instructions for things it may want to do, like interact with the real world, like you have access to these tools. Like this tool can search the internet.
This tool can pull up like a phone call. This tool can, like Cursor doesn’t have that ability, but I’m actually building something with set ability. I’ll get [00:08:00 ] to that in a second. You know, this tool can open files on the computer, et cetera. And so if the output includes a tool call, then it automatically, rather than switching control to the user, runs the model again.
And it will keep doing this until it believes it has finished the task that at was given, and you can even push it to do this Conti Contiguously. What we’re building was our fab.ai now, so you can try our models right now. Right now we h