Interviewing Sebastian Raschka on the state of open LLMs, Llama 3.1, and AI education

Update: 2024-08-01

Description

This week, I had the pleasure of chatting with Sebastian Raschka. Sebastian is doing a ton of work on the open language model ecosystem and AI research broadly. He’s been writing the great Ahead of AI newsletter (that has the biggest audience overlap with Interconnects, at 26%, so a lot of you know him) and multiple educational books, all on top of being a full time machine learning engineer at Lightning.ai, where he maintains LitGPT, which he described as being like Karpathy’s NanoGPT, with slightly more abstractions.

This conversation mostly surrounds keeping up with AI research, the state of the open LLM ecosystem post Llama 3.1, and many narrow topics in between. I learned that Sebastian used to be an Arxiv moderator, which gives some simple color on how Arxiv and sifting through thousands of papers works. We cover a lot of ground here, so I hope you enjoy it.

Listen on Apple Podcasts, Spotify, and where ever you get your podcasts. For other interviews, go here.

YouTube

Chapters

* [00:00:00 ] Introduction & Sebastian’s background

* [00:04:28 ] The state of deep learning and language models in 2018

* [00:08:02 ] Sebastian's work at Lightning AI and LitGPT

* [00:12:23 ] Distillation and its potential in language model training

* [00:14:14 ] Implementing language models and common pitfalls

* [00:18:45 ] Modern architectures: Mixture of experts models, early v. late fusion multimodal

* [00:24:23 ] Sebastian's book on building language models from scratch

* [00:27:13 ] Comparing ChatGPT, Claude, and Google's Gemini for various tasks

* [00:38:21 ] Vibing and checking new language models during implementation

* [00:40:42 ] Selecting papers to read and moderating Arxiv

* [00:45:36 ] Motivation for working on AI education

* [00:52:46 ] Llama 3 fine-tuning

* [00:57:26 ] The potential impact of AI on jobs in writing and education

* [01:00:57 ] The future directions of AI

Transcript

Built with smol-podcaster and with love of Latent Space.

Nathan Lambert [00:00:00 ]: Hey, Sebastian, welcome to this kind of interconnects, normally researcher interviews. You were a professor, so that definitely counts. You do a lot of different things these days. Let's get talking into language models. Welcome. Yeah.

Sebastian Raschka [00:01:35 ]: Thanks so much for the invitation, Nathan. I'm a big fan actually of the interconnects newsletter, so I'm hoping we can have some fun chat about research, LLMs, and what's hot these days, basically. Yeah.

Nathan Lambert [00:01:48 ]: I have a little section on the end, which is keeping up with AI research, writing about AI and process, because you do so many things, but I kind of want to jump into how you got to AI, because you have an interesting career path. So you were a professor at Wisconsin Madison for years. I saw in statistics, which ... I also went all the way back to find your PhD thesis, which was uncovering hidden patterns of molecular recognition. So this was a while ago, and is this kind of ... Can you explain your background and how you got into AI? I'm guessing it's through computational statistics or something like this.

Sebastian Raschka [00:02:24 ]: Yeah. Close. So yeah, you did some research there. Interesting. So yeah, it's been a long time since my PhD thesis. This is maybe seven years now. And back then, it started even earlier when I got into AI, that was like, I would say 2012-ish. I was in grad school and I was taking a statistical pattern classification class. And in that class, yeah, the star of the show was basically naive Bayes classifiers, or in general, Bayesian methods for pattern recognition. And from there, I kind of really got into machine learning. So there was, I would say, more statistical-based, but it was all about classifying things. And then I think it was also right about the time where Cozera was launched, and I saw Andrew Ng's Cozera class. That was, I think, the first class in 2011-12 back then. And yeah, that's basically how I started from statistical pattern classification into machine learning. And I applied that for computational biology problems like molecule and drug discovery, like pharmaceutical drug discovery. And yeah, from there, I joined at some point after my graduation, the University of Wisconsin in Madison, where I was in the statistics department, but I did mostly deep learning research, essentially. I was the only one basically doing Python, deep learning, machine learning stuff. So yeah.

Nathan Lambert [00:03:48 ]: What year was this, and what did it look like at the time?

Sebastian Raschka [00:03:52 ]: That was around 2018, I think August 2018, when I joined the department. And yeah, I mean, so it's the statistics department, but my work was technically all machine learning and deep learning. I mean, a lot of students were really excited about learning machine learning. I think it was just around the time where it got really popular. And yeah, I was teaching machine learning and deep learning classes as well. They were always like, you know, full and crowded, like a lot of students were excited about that. Also, in general, like the time learning about Python, machine learning, data science, all these topics.

Nathan Lambert [00:04:28 ]: It's, I mean, it's very interesting because I was a student, I was a grad student at this time or that time in like 2018. That's what deep RL was really taking off. And it probably feels like that probably felt kind of like the language model thing was like as a student at the time, where it's just like, there's so many people in all these classes. And now language models have more of a real world application, but I think as a student, it probably feels so, so similar. Yeah.

Sebastian Raschka [00:04:50 ]: So also back then, if I may say that it's like large language models already existed. I think the GPT paper, was it 2018? Something like that?

Nathan Lambert [00:04:59 ]: Yeah, 2018 or 2019. Yeah. For GPT-2, I think.

Sebastian Raschka [00:05:04 ]: Remember covering, like I had a whole hour or two hours on large language models back then, but it was all focused on BERT models and basically also using them for more like classification tasks. Now, I would say maybe a lot of business problems still evolve around classification, but everything else is basically generative, generating text, generating images and stuff. So it has changed a lot.

Nathan Lambert [00:05:28 ]: Yeah, for sure. It's like a sequence of like, is it like the transform, is it like Elmo, BERT and the transformers are probably the things that you're talking about all the time? Just very interesting. I think Yitay had this, did you read Yitay's recent blog posts on language model architectures and kind of walked through why encoder decoder is no longer in vogue? Did you see this?

Sebastian Raschka [00:05:51 ]: Yeah, I think I haven't seen the article, but I remember having discussions with people about that recently. I mean, I think there was actually, it's interesting. So I think T5, if you would train it and fine tune it, it would still be a really good model for sequence to sequence tasks, like language translation and stuff like that.

Nathan Lambert [00:06:10 ]: Yeah. Cohere for AI did this with AYA. They used T5 for their first AYA version, which most people were like, oh, they've Cohere branded it so well, but no one realized they're using T5.

Sebastian Raschka [00:06:21 ]: See, I even didn't know about that. And so also on that note, I would say there was something else I wanted to say. So then there's also still the classification thing and using LLMs for classification. And it was also usually either a bird like encoder, or you could also use an encoder decoder, but mostly an encoder. But I've seen also recent papers using just decoder models for that. Just basically removing the, I saw two papers on that actually, like removing the causal mask. So basically reverting it back to an encoder using LLMA and then removing the mask. So in that sense.

Nathan Lambert [00:06:59 ]: And it works well as a classifier. You can just kind of use it. That's awesome.

Sebastian Raschka [00:07:04 ]: I mean, you could even do that without removing the causal mask. So you could just tune the last token basically, but yeah, if you remove it, yeah. They found that you could use probably the first token even, because if you have the last token, you don't, you have to have padding always because you have to pad it to the longest sequence. Otherwise the last token would be a different one in each training example. And so in this way you could use an earlier token basically, and then keep it fixed.

Nathan Lambert [00:07:30 ]: Yeah. Yeah. Now with your work at Lightning AI, do you do a lot of these things like hacking around with language models? Because I think it's kind of