ThursdAI - May 2nd - New GPT2? Copilot Workspace, Evals and Vibes from Reka, LLama3 1M context (+ Nous finetune) & more AI news
Description
Hey π Look it May or May not be the first AI newsletter you get in May, but it's for sure going to be a very information dense one. As we had an amazing conversation on the live recording today, over 1K folks joined to listen to the first May updates from ThursdAI.
As you May know by now, I just love giving the stage to folks who are the creators of the actual news I get to cover from week to week, and this week, we had again, 2 of those conversations.
First we chatted with Piotr Padlewski from Reka, the author on the new Vibe-Eval paper & Dataset which they published this week. We've had Yi and Max from Reka on the show before, but it was Piotr's first time and he was super super knowledgeable, and was really fun to chat with.
Specifically, as we at Weights & Biases launch a new product called Weave (which you should check out at https://wandb.me/weave) I'm getting more a LOT more interested in Evaluations and LLM scoring, and in fact, we started the whole show today with a full segment on Evals, Vibe checks and covered a new paper from Scale about overfitting.
The second deep dive was with my friend Idan Gazit, from GithubNext, about the new iteration of Github Copilot, called Copilot Workspace. It was a great one, and you should definitely give that one a listen as well
TL;DR of all topics covered + show notes
* Scores and Evals
* No notable changes, LLama-3 is still #6 on LMsys
* gpt2-chat came and went (in depth chan writeup)
* Scale checked for Data Contamination on GSM8K using GSM-1K (Announcement, Paper)
* Vibes-Eval from Reka - a set of multimodal evals (Announcement, Paper, HF dataset)
* Open Source LLMs
* Gradient releases 1M context window LLama-3 finetune (X)
* MaziyarPanahi/Llama-3-70B-Instruct-DPO-v0.4 (X, HF)
* Nous Research - Hermes Pro 2 - LLama 3 8B (X, HF)
* AI Town is running on Macs thanks to Pinokio (X)
* LMStudio releases their CLI - LMS (X, Github)
* Big CO LLMs + APIs
* Github releases Copilot Workspace (Announcement)
* AI21 - releases Jamba Instruct w/ 256K context (Announcement)
* Google shows Med-Gemini with some great results (Announcement)
* Claude releases IOS app and Team accounts (X)
* This weeks Buzz
* We're heading to SF to sponsor the biggest LLama-3 hackathon ever with Cerebral Valley (X)
* Check out my video for Weave our new product, it's just 3 minutes (Youtube)
* Vision & Video
* Intern LM open sourced a bunch of LLama-3 and Phi based VLMs (HUB)
* And they are MLXd by the "The Bloke" of MLX, Prince Canuma (X)
* AI Art & Diffusion & 3D
* ByteDance releases Hyper-SD - Stable Diffusion in a single inference step (Demo)
* Tools & Hardware
* Still haven't open the AI Pin, and Rabbit R1 just arrived, will open later today
* Co-Hosts and Guests
* Piotr Padlewski (@PiotrPadlewski) from Reka AI
* Idan Gazit (@idangazit) from Github Next
* Wing Lian (@winglian)
* Nisten Tahiraj (@nisten)
* Yam Peleg (@yampeleg)
* LDJ (@ldjconfirmed)
* Wolfram Ravenwolf (@WolframRvnwlf)
* Ryan Carson (@ryancarson)
Scores and Evaluations
New corner in today's pod and newsletter given the focus this week on new models and comparing them to existing models.
What is GPT2-chat and who put it on LMSys? (and how do we even know it's good?)
For a very brief period this week, a new mysterious model appeared on LMSys, and was called gpt2-chat. It only appeared on the Arena, and did not show up on the leaderboard, and yet, tons of sleuths from 4chan to reddit to X started trying to figure out what this model was and wasn't.
Folks started analyzing the tokenizer, the output schema, tried to get the system prompt and gauge the context length. Many folks were hoping that this is an early example of GPT4.5 or something else entirely.
It did NOT help that uncle SAMA first posted the first tweet and then edited it to remove the - and it was unclear if he's trolling again or foreshadowing a completely new release or an old GPT-2 but retrained on newer data or something.
The model was really surprisingly good, solving logic puzzles better than Claude Opus, and having quite amazing step by step thinking, and able to provide remarkably informative, rational, and relevant replies. The average output quality across many different domains places it on, at least, the same level as high-end models such as GPT-4 and Claude Opus.
Whatever this model was, the hype around it made LMSYS add a clarification to their terms and temporarily take off the model now. And we're waiting to hear more news about what it is.
Reka AI gives us Vibe-Eval a new multimodal evaluation dataset and score (Announcement, Paper, HF dataset)
Reka keeps surprising, with only 20 people in the company, their latest Reka Core model is very good in multi modality, and to prove it, they just released a new paper + a new method of evaluating multi modal prompts on VLMS (Vision enabled Language Models)
Their new Open Benchmark + Open Dataset is consistent of this format:
And I was very happy to hear from one of the authors on the paper @PiotrPadlewski on the pod, where he mentioned that they were trying to create a dataset that was going to be very hard for their own model (Reka Core) and just decided to keep evaluating other models on it.
They had 2 main objectives : (i) vibe checking multimodal chat models for day-to-day tasks and (ii) deeply challenging and probing the capabilities of present frontier models. To this end, the hard set contains > 50% questions that all frontier models answer incorrectly
Chatting with Piotr about it, he mentioned that not only did they do a dataset, they actually used Reka Core as a Judge to score the replies from all models on that dataset and found that using their model in this way roughly correlates to non-expert human judgement! Very very interesting stuff.
The "hard" set is ... well hard!
Piotr concluded that if folks want to do research, they will provide free API access to Reka for that, so hit them up over DMs if you want to take this eval for a spin on your new shiny VLM (or indeed verify the metrics they put up)
Scale tests for eval dataset contamination with GSM-1K (Announcement, Paper)
Scale.ai is one of the most prominent companies in AI you may never have heard of, they are valued at $13B dollars and have pivoted from data processing for autonomous vehicles to being the darling of the government, with agreements from the DoD for data pipeline and evaluation for US Military.
They have released a new paper as well, creating