Interviewing Riley Goodside on the science of prompting

Update: 2024-09-30

Description

Riley Goodside is a staff prompting engineer at Scale AI. Previously working in data science, he is often seen as the default for the new role of a “prompt engineer.” He regularly posts incisive prompts that illicit notable behavior from the most popular AI models.

I really resonated with this saying from Anthropic’s recent podcast on prompt engineering — “now we write essays and treat them as code.” In order to be good at prompting, you need to understand that natural language operates as our code used to.

This episode is a masterclass on why you should care about prompting and how it impacts results. Of course, there’s a bunch of great discussion on recent models that reflect the need for different and or better prompting. Enjoy it!

Listen on Apple Podcasts, Spotify, and where ever you get your podcasts. For other Interconnects interviews, go here.

We mention:

* Prompting to push the frontier of AI models,

* Post-training and prompting interaction,

* Prompting base models,

* o1, Reflection 70B, reasoning,

* Scale’s leaderboard, evaluation tricks, evaluation needs,

* PlanSearch paper

* Julius AI

* “The hottest programming language is english”

* “Think silently” instructions

* Scale Leaderboard and Humanity’s Last Exam

* ChatML formatting

Chapters

* [00:00:09 ] Introduction

* [00:02:40 ] Riley's path to LLMs

* [00:07:54 ] Impact of ChatGPT on prompt engineering

* [00:12:03 ] OpenAI's o1

* [00:18:21 ] Autoregressive inference and prompting sensitivities

* [00:24:48 ] Reflection 70B model and its implications

* [00:28:00 ] Impact of prompting on evaluation

* [00:32:43 ] Prompting vs. Google search

* [00:46:55 ] Prompting and RLHF/post-training

* [00:56:57 ] Prompting of AI agents

* [01:01:20 ] Importance of hands-on experience with language models

* [01:05:00 ] Importance and challenges of AI model evaluation

Transcript

Built with smol-podcaster.

Nathan L. [00:01:08 ]: Hey, Riley, welcome to the show.

Riley G. Hey, Nathan, great to be here.

Nathan L. [00:01:14 ]: Yeah, so for the audience here, I mostly wanted to try to, as I work on post-training a lot and I see my own difficulty in taking prompting seriously and the things that I don't think that we are doing enough, and I don't see any reason why it can't be scientific in how we do prompting. So that's my biggest goal with this. I think there's a lot of podcasts where we could kind of say, like, what is the history of prompting? Where is it going? And that's easy to kind of redo. And I still find it interesting, but I just don't think there's enough people talking about the role of prompting in evaluation, how prompting changes with how your post-training models, because we're trying to take that seriously and how we have a post-training setup, but we just like regularly run into these things like system prompts aren't handled well, how to release a model of a system prompt. So that's the tone that I'm trying to get to when I ask these questions. And also OpenAI's 01 model just came out, so I'm definitely going to get onto that pretty quickly because that's what everyone's excited about. I like to start with background just to kind of get to know people, because a lot of this is just, I want to talk to interesting people in AI, is like, how did you become interested in prompting? I think I've seen your background in data science and then your joint scale around when Chad2BT came out, which is fun timing, but like, how did you become maybe obsessed with this, but like the focal point of your work?

Riley G. [00:02:40 ]: Yeah, I have sort of an unusual introduction to large language models. For most of my career, I've been a data scientist, mostly in the on-mandating industry. I was at OkCupid and Grindr. And after I left Grindr, I took sort of a sabbatical to educate myself, I guess, about the progress in large language models. It was around the time that GPT-3 codecs had just come out. And that was where I think I started to become really interested because I was following along with maybe, certainly when GPT-2 came out, the examples there wowed me as much as they wowed the rest of the world, I think, with the example of the news article about the unicorn and all that. And not long after that, we had AI Dungeon, and I played around with AI Dungeon a bit. But at that point, language models seemed to be mostly about language, that they were sort of very heavily focused on stylistic mimicry and creative writing and so on. And when Codex came out, it really started this thought of that text is a more universal interface than we were giving you credit for, that language models might be more broadly useful. And I just became very excited in a practical sense of what these models could do for what I kind of intuited was very boilerplate-like data science code, that I thought of like most of the Python and Julia and R and things that I've written over my career, this seemed like stuff that an LLM could handle. And that was sort of one of its early strong points. So I was playing around with, I think one of my first projects was a VS Code extension that had some kind of integration with Codex. But I never really shipped anything out of it. And mostly what it transitioned into pretty quickly was playing around with posting prompting examples on Twitter, because when I looked out online to find what were people saying about how to prompt these models, there really wasn't much out there. And so I had to kind of resort to just like the few examples that had been circulating in viral screenshots of humorous completions and so on, of like the results that people got out of it. And I started posting those examples. I started following academics and low-level engineers at the research labs and anyone that was working in shipping language models I thought were interesting. And elbowed my way in.

Nathan L. [00:05:18 ]: I have more questions on this, because I find it like, some people find, there's this whole like Twitter dynamic of like, you find so much signal there, but the question is like, how much does it generalize? Because there's so many of the lessons you can learn from these models, from these examples. I think the straw, like the number of R's in strawberry things is the current one. And then, and it's like, do you get a sense that these are transient or are these kind of repeated themes? And like, how should you read these examples to try to extract themes from them? If like, I've followed you for a while, and a lot of people do, and you're more insightful in how you post them. If you post these threads with like multiple tries and stuff like this, like, should people be doing that when they see something pop up?

Riley G. [00:06:03 ]: I think so. I also would say that Twitter is a very different river to step into now than it was back then. At the point that I started doing this, like, nobody was really talking about these things that much, or to the extent they were, it was sort of fleeting. It was like, wow, look at this, and then they on to the next thing. And I think the thing that's very different now is just that because there are so many new entrants in AI and LLM, there's a lot of rehashing of the basics. And I think a lot of people in the industry would tell you that the popular examples that you see around of like, how many R's are in strawberry, or some of the ones that I'm partially responsible for, popularizing at least. I think like, these things are really just like, rookie mistakes in some sense, right? That these are things that we've long known language models can't do. And it just keeps popping up as a surprising quirk of language models that I think the public is just confused that something could be so good at so many other things and so bad at this. Right? That is seemingly trivial task, and that is hard to explain to people. And the answer to that hasn't really changed much in the past few years. They're generally bad at spelling for kind of the same reasons they were bad at spelling two or three years ago.

Nathan L. [00:07:27 ]: Yeah. I mean, like, how did these things change with ChatGPT? Because ChatGPT is like the introduction of RLHF into these models. And I think, I didn't write this down as a question, but there's like the difference in patronizing base models and instruction models and RLHF models, which I think that for most of this discussion, it's like the end model, the like chat RLHF model is the one that people think about.