DiscoverThursdAI - The top AI news from the past week
Claim Ownership
ThursdAI - The top AI news from the past week
Author: From Weights & Biases, Join AI Evangelist Alex Volkov and a panel of experts to cover everything important that happened in the world of AI from the past week
Subscribed: 21Played: 252Subscribe
Share
© Alex Volkov
Description
From Weights & Biases - ThursdAI, the podcast that keeps you ahead of the AI curve. Hosted by AI Evangelist Alex Volkov with a changing panel expert guests, discussing every important AI piece of news and updates from the past week, Open source and more
67 Episodes
Reverse
Welcome back everyone, can you believe it's another ThursdAI already? And can you believe me when I tell you that friends of the pod Matt Shumer & Sahil form Glaive.ai just dropped a LLama 3.1 70B finetune that you can download that will outperform Claude Sonnet 3.5 while running locally on your machine? Today was a VERY heavy Open Source focused show, we had a great chat w/ Niklas, the leading author of OLMoE, a new and 100% open source MoE from Allen AI, a chat with Eugene (pico_creator) about RWKV being deployed to over 1.5 billion devices with Windows updates and a lot more. In the realm of the big companies, Elon shook the world of AI by turning on the biggest training cluster called Colossus (100K H100 GPUs) which was scaled in 122 days 😮 and Anthropic announced that they have 500K context window Claude that's only reserved if you're an enterprise customer, while OpenAI is floating an idea of a $2000/mo subscription for Orion, their next version of a 100x better chatGPT?! TL;DR* Open Source LLMs * Matt Shumer / Glaive - Reflection-LLama 70B beats Claude 3.5 (X, HF)* Allen AI - OLMoE - first "good" MoE 100% OpenSource (X, Blog, Paper, WandB)* RWKV.cpp is deployed with Windows to 1.5 Billion devices* MMMU pro - more robust multi disipline multimodal understanding bench (proj)* 01AI - Yi-Coder 1.5B and 9B (X, Blog, HF)* Big CO LLMs + APIs* Replit launches Agent in beta - from coding to production (X, Try It)* Ilya SSI announces 1B round from everyone (Post)* Cohere updates Command-R and Command R+ on API (Blog)* Claude Enterprise with 500K context window (Blog)* Claude invisibly adds instructions (even via the API?) (X)* Google got structured output finally (Docs)* Amazon to include Claude in Alexa starting this October (Blog)* X ai scaled Colossus to 100K H100 GPU goes online (X)* DeepMind - AlphaProteo new paper (Blog, Paper, Video)* This weeks Buzz* Hackathon did we mention? We're going to have Eugene and Greg as Judges!* AI Art & Diffusion & 3D* ByteDance - LoopyAvatar - Audio Driven portait avatars (Page)Open Source LLMsReflection Llama-3.1 70B - new 👑 open source LLM from Matt Shumer / GlaiveAI This model is BANANAs folks, this is a LLama 70b finetune, that was trained with a new way that Matt came up with, that bakes CoT and Reflection into the model via Finetune, which results in model outputting its thinking as though you'd prompt it in a certain way. This causes the model to say something, and then check itself, and then reflect on the check and then finally give you a much better answer. Now you may be thinking, we could do this before, RefleXion (arxiv.org/2303.11366) came out a year ago, so what's new? What's new is, this is now happening inside the models head, you don't have to reprompt, you don't even have to know about these techniques! So what you see above, is just colored differently, but all of it, is output by the model without extra prompting by the user or extra tricks in system prompt. the model thinks, plans, does chain of thought, then reviews and reflects, and then gives an answer! And the results are quite incredible for a 70B model 👇Looking at these evals, this is a 70B model that beats GPT-4o, Claude 3.5 on Instruction Following (IFEval), MATH, GSM8K with 99.2% 😮 and gets very close to Claude on GPQA and HumanEval! (Note that these comparisons are a bit of a apples to ... different types of apples. If you apply CoT and reflection to the Claude 3.5 model, they may in fact perform better on the above, as this won't be counted 0-shot anymore. But given that this new model is effectively spitting out those reflection tokens, I'm ok with this comparison)This is just the 70B, next week the folks are planning to drop the 405B finetune with the technical report, so stay tuned for that! Kudos on this work, go give Matt Shumer and Glaive AI a follow! Allen AI OLMoE - tiny "good" MoE that's 100% open source, weights, code, logsWe've previously covered OLMO from Allen Institute, and back then it was obvious how much commitment they have to open source, and this week they continued on this path with the release of OLMoE, an Mixture of Experts 7B parameter model (1B active parameters), trained from scratch on 5T tokens, which was completely open sourced. This model punches above its weights on the best performance/cost ratio chart for MoEs and definitely highest on the charts of releasing everything. By everything here, we mean... everything, not only the final weights file; they released 255 checkpoints (every 5000 steps), the training code (Github) and even (and maybe the best part) the Weights & Biases logs! It was a pleasure to host the leading author of the OLMoE paper, Niklas Muennighoff on the show today, so definitely give this segment a listen, he's a great guest and I learned a lot! Big Companies LLMs + APIAnthropic has 500K context window Claude but only for Enterprise? Well, this sucks (unless you work for Midjourney, Airtable or Deloitte). Apparently Anthropic has been sitting on Claude that can extend to half a million tokens in the context window, and decided to keep it to themselves and a few trial enterprises, and package it as an Enterprise offering. This offering now includes, beyond just the context window, also a native Github integration, and a few key enterprise features like access logs, provisioning and SCIM and all kinds of "procurement and CISO required" stuff enterprises look for. To be clear, this is a great move for Anthropic, and this isn't an API tier, this is for their front end offering, including the indredible artifacts tool, so that companies can buy their employees access to Claude.ai and have them be way more productive coding (hence the Github integration) or summarizing (very very) long documents, building mockups and one off apps etc' Anthropic is also in the news this week, because Amazon announced that it'll use Claude as the backbone for the smart (or "remarkable" as they call it) Alexa brains coming up in October, which, again, incredible for Anthropic distribution, as there are maybe 100M Alexa users in the world or so. Prompt injecting must stop! And lastly, there have been mounting evidence, including our own Wolfram Ravenwolf that confirmed it, that Anthropic is prompt injecting additional context into your own prompts, in the UI but also via the API! This is awful practice and if anyone from there reads this newsletter, please stop or at least acknowledge. Claude apparently just... thinks that it's something my users said, when in fact, it's some middle layer of anthropic security decided to just inject some additional words in there!XAI turns on the largest training GPU SuperCluster Colossus - 100K H100 GPUSThis is a huge deal for AI, specifically due to the time this took and the massive massive scale of this SuperCluster. SuperCluster means all these GPUs sit in one datacenter, drawing from the same power-grid and can effectively run single training jobs. This took just 122 days for Elon and the XAI team to go from an empty warehouse in Memphis to booting up an incredible 100K H100, and they claim that they will double this capacity by adding 50K H200 in the next few months. As Elon mentioned when they released Grok2, it was trained on 15K, and it matched GPT4! Per SemiAnalisys, this new Supercluster can train a GPT-4 level model in just 4 days 🤯 XAI was founded a year ago, and by end of this year, they plan for Grok to be the beast LLM in the world, and not just get to GPT-4ish levels, and with this + 6B investment they have taken in early this year, it seems like they are well on track, which makes some folks at OpenAI reportedly worriedThis weeks buzz - we're in SF in less than two weeks, join our hackathon! This time I'm very pleased to announce incredible judges for our hackathon, the spaces are limited, but there's still some spaces so please feel free to sign up and join usI'm so honored to announce that we'll have Eugene Yan (@eugeneyan), Greg Kamradt (@GregKamradt) and Charles Frye (@charles_irl) on the Judges panel. 🤩 It'll be incredible to have these folks see what hackers come up with, and I'm excited as this comes closer! Replit launches Agents beta - a fully integrated code → deployment agent Replit is a great integrated editing environment, with database and production in 1 click and they've had their LLMs trained on a LOT of code helping folks code for a while. Now they are launching agents, which seems very smart from them, given that development is much more than just coding. All the recent excitement we see about Cursor, is omitting the fact that those demos are only working for folks who already know how to set up the environment, and then there's the need to deploy to production, maintain.Replit has that basically built in, and now their Agent can build a plan and help you build those apps, and "ship" them, while showing you what they are doing. This is massive, and I can't wait to play around with this! The additional benefit of Replit is that they nailed the mobile app experience as well, so this now works from mobile, on the go! In fact, as I was writing this, I got so excited that I paused for 30 minutes, payed the yearly subscription and decided to give building an app a try! The fact that this can deploy and run the server and the frontend, detect errors, fix them, and then also provision a DB for me, provision Stripe, login buttons and everything else is quite insane. Can't wait to see what I can spin up with this 🔥 (and show all of you!) Loopy - Animated Avatars from ByteDance A new animated avatar project from folks at ByteDance just dropped, and it’s WAY clearer than anything we’ve seen before, like EMO or anything else. I will just add this video here for you to enjoy and look at the earring movements, vocal cords, eyes, everything! I of course wanted to know if I’ll ever be able to use this, and .. likely no, here’s the response I got from Jianwen one of the Authors today. That's it for this week, we've
Hey, for the least time during summer of 2024, welcome to yet another edition of ThursdAI, also happy skynet self-awareness day for those who keep track :) This week, Cerebras broke the world record for fastest LLama 3.1 70B/8B inference (and came on the show to talk about it) Google updated 3 new Geminis, Anthropic artifacts for all, 100M context windows are possible, and Qwen beats SOTA on vision models + much more! As always, this weeks newsletter is brought to you by Weights & Biases, did I mention we're doing a hackathon in SF in September 21/22 and that we have an upcoming free RAG course w/ Cohere & Weaviate? TL;DR* Open Source LLMs * Nous DisTrO - Distributed Training (X , Report)* NousResearch/ hermes-function-calling-v1 open sourced - (X, HF)* LinkedIN Liger-Kernel - OneLine to make Training 20% faster & 60% more memory Efficient (Github)* Cartesia - Rene 1.3B LLM SSM + Edge Apache 2 acceleration (X, Blog)* Big CO LLMs + APIs* Cerebras launches the fastest AI inference - 447t/s LLama 3.1 70B (X, Blog, Try It)* Google - Gemini 1.5 Flash 8B & new Gemini 1.5 Pro/Flash (X, Try it)* Google adds Gems & Imagen to Gemini paid tier* Anthropic artifacts available to all users + on mobile (Blog, Try it)* Anthropic publishes their system prompts with model releases (release notes)* OpenAI has project Strawberry coming this fall (via The information)* This weeks Buzz* WandB Hackathon hackathon hackathon (Register, Join)* Also, we have a new RAG course w/ Cohere and Weaviate (RAG Course)* Vision & Video* Zhipu AI CogVideoX - 5B Video Model w/ Less 10GB of VRAM (X, HF, Try it)* Qwen-2 VL 72B,7B,2B - new SOTA vision models from QWEN (X, Blog, HF)* AI Art & Diffusion & 3D* GameNgen - completely generated (not rendered) DOOM with SD1.4 (project)* FAL new LORA trainer for FLUX - trains under 5 minutes (Trainer, Coupon for ThursdAI)* Tools & Others* SimpleBench from AI Explained - closely matches human experience (simple-bench.com)ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.Open SourceLet's be honest - ThursdAI is a love letter to the open-source AI community, and this week was packed with reasons to celebrate.Nous Research DiStRO + Function Calling V1Nous Research was on fire this week (aren't they always?) and they kicked off the week with the release of DiStRO, which is a breakthrough in distributed training. You see, while LLM training requires a lot of hardware, it also requires a lot of network bandwidth between the different GPUs, even within the same data center. Proprietary networking solutions like Nvidia NVLink, and more open standards like Ethernet work well within the same datacenter, but training across different GPU clouds has been unimaginable until now. Enter DiStRo, a new decentralized training by the mad geniuses at Nous Research, in which they reduced the required bandwidth to train a 1.2B param model from 74.4GB to just 86MB (857x)! This can have massive implications for training across compute clusters, doing shared training runs, optimizing costs and efficiency and democratizing LLM training access! So don't sell your old GPUs just yet, someone may just come up with a folding@home but for training the largest open source LLM, and it may just be Nous! Nous Research also released their function-calling-v1 dataset (HF) that was used to train Hermes-2, and we had InterstellarNinja who authored that dataset, join the show and chat about it. This is an incredible unlock for the open source community, as function calling become a de-facto standard now. Shout out to the Glaive team as well for their pioneering work that paved the way!LinkedIn's Liger Kernel: Unleashing the Need for Speed (with One Line of Code)What if I told you, that whatever software you develop, you can add 1 line of code, and it'll run 20% faster, and require 60% less memory? This is basically what Linkedin researches released this week with Liger Kernel, yes you read that right, Linkedin, as in the website you career related posts on! "If you're doing any form of finetuning, using this is an instant win"Wing Lian - AxolotlThis absolutely bonkers improvement in training LLMs, now works smoothly with Flash Attention, PyTorch FSDP and DeepSpeed. If you want to read more about the implementation of the triton kernels, you can see a deep dive here, I just wanted to bring this to your attention, even if you're not technical, because efficiency jumps like these are happening all the time. We are used to seeing them in capabilities / intelligence, but they are also happening on the algorithmic/training/hardware side, and it's incredible to see!Huge shoutout to Byron and team at Linkedin for this unlock, check out their Github if you want to get involved!Qwen-2 VL - SOTA image and video understanding + open weights mini VLMYou may already know that we love the folks at Qwen here on ThursdAI, not only because Junyang Lin is a frequeny co-host and we get to hear about their releases as soon as they come out (they seem to be releasing them on thursdays around the time of the live show, I wonder why!) But also because, they are committed to open source, and have released 2 models 7B and 2B with complete Apache 2 license! First of all, their Qwen-2 VL 72B model, is now SOTA at many benchmarks, beating GPT-4, Claude 3.5 and other much bigger models. This is insane. I literally had to pause Junyang and repeat what he said, this is a 72B param model, that beats GPT-4o on document understanding, on math, on general visual Q&A. Additional Capabilities & Smaller modelsThey have added new capabilities in these models, like being able to handle arbitrary resolutions, but the one I'm most excited about is the video understanding. These models can now understand up to 20 minutes of video sequences, and it's not just "split the video to 10 frames and do image caption", no, these models understand video progression and if I understand correctly how they do it, it's quite genius. They the video embed time progression into the model using a new technique called M-RoPE, which turns the time progression into rotary positional embeddings. Now, the 72B model is currently available via API, but we do get 2 new small models with Apache 2 license and they are NOT too shabby either! 7B parameters (HF) and 2B Qwen-2 VL (HF) are small enough to run completely on your machine, and the 2B parameter, scores better than GPT-4o mini on OCR-bench for example! I can't wait to finish writing and go play with these models! Big Companies & LLM APIsThe biggest news this week came from Cerebras System, a relatively unknown company, that shattered the world record for LLM inferencing out of the blue (and came on the show to talk about how they are doing it)Cerebras - fastest LLM inference on wafer scale chipsCerebras has introduced the concept of wafer scale chips to the world, which is, if you imagine a microchip, they are the size of a post stamp maybe? GPUs are bigger, well, Cerebras are making chips the sizes of an iPad (72 square inches), largest commercial chips in the world. And now, they created an inference stack on top of those chips, and showed that they have the fastest inference in the world, how fast? Well, they can server LLama 3.1 8B at a whopping 1822t/s. No really, this is INSANE speeds, as I was writing this, I copied all the words I had so far, went to inference.cerebras.ai , asked to summarize, pasted and hit send, and I immediately got a summary! "The really simple explanation is we basically store the entire model, whether it's 8B or 70B or 405B, entirely on the chip. There's no external memory, no HBM. We have 44 gigabytes of memory on chip."James WangThey not only store the whole model (405B coming soon), but they store it in full fp16 precision as well, so they don't quantize the models. Right now, they are serving it with 8K tokens in context window, and we had a conversation about their next steps being giving more context to developers. The whole conversation is well worth listening to, James and Ian were awesome to chat with, and while they do have a waitlist, as they gradually roll out their release, James said to DM him on X and mention ThursdAI, and he'll put you through, so you'll be able to get an OpenAI compatible API key and be able to test this insane speed. P.S - we also did an independent verification of these speeds, using Weave, and found Cerebras to be quite incredible for agentic purposes, you can read our report here and the weave dashboard hereAnthropic - unlocking just-in-time applications with artifacts for allWell, if you aren't paying claude, maybe this will convince you. This week, anthropic announced that artifacts are available to all users, not only their paid customers. Artifacts are a feature in Claude that is basically a side pane (and from this week, a drawer in their mobile apps) that allows you to see what Claude is building, by rendering the web application almost on the fly. They have also trained Claude in working with that interface, so it knows about the different files etcEffectively, this turns Claude into a web developer that will build mini web applications (without backend) for you, on the fly, for any task you can think of. Drop a design, and it'll build a mock of it, drop some data in a CSV and it'll build an interactive onetime dashboard visualizing that data, or just ask it to build an app helping you split the bill between friends by uploading a picture of a bill. Artifacts are share-able and remixable, so you can build something and share with friends, so here you go, an artifact I made, by dropping my notes into claude, and asking for a magic 8 Ball, that will spit out a random fact from today's editing of ThursdAI. I also provided Claude with an 8Ball image, but it didn't work due to restrictions, so instead I just uploaded that image to claude and asked it to recreate it
Hey there, Alex here with an end of summer edition of our show, which did not disappoint. Today is the official anniversary of stable diffusion 1.4 can you believe it? It's the second week in the row that we have an exclusive LLM launch on the show (after Emozilla announced Hermes 3 on last week's show), and spoiler alert, we may have something cooking for next week as well!This edition of ThursdAI is brought to you by W&B Weave, our LLM observability toolkit, letting you evaluate LLMs for your own use-case easilyAlso this week, we've covered both ends of AI progress, doomerist CEO saying "Fck Gen AI" vs an 8yo coder and I continued to geek out on putting myself into memes (I promised I'll stop... at some point) so buckle up, let's take a look at another crazy week: TL;DR* Open Source LLMs * AI21 releases Jamba1.5 Large / Mini hybrid Mamba MoE (X, Blog, HF)* Microsoft Phi 3.5 - 3 new models including MoE (X, HF)* BFCL 2 - Berkley Function Calling Leaderboard V2 (X, Blog, Leaderboard)* NVIDIA - Mistral Nemo Minitron 8B - Distilled / Pruned from 12B (HF)* Cohere paper proves - code improves intelligence (X, Paper)* MOHAWK - transformer → Mamba distillation method (X, Paper, Blog)* AI Art & Diffusion & 3D* Ideogram launches v2 - new img diffusion king 👑 + API (X, Blog, Try it) * Midjourney is now on web + free tier (try it finally)* Flux keeps getting better, cheaper, faster + adoption from OSS (X, X, X)* Procreate hates generative AI (X)* Big CO LLMs + APIs* Grok 2 full is finally available on X - performs well on real time queries (X)* OpenAI adds GPT-4o Finetuning (blog)* Google API updates - 1000 pages PDFs + LOTS of free tokens (X)* This weeks Buzz* Weights & Biases Judgement Day SF Hackathon in September 21-22 (Sign up to hack)* Video * Hotshot - new video model - trained by 4 guys (try it, technical deep dive)* Luma Dream Machine 1.5 (X, Try it) * Tools & Others* LMStudio 0.0.3 update - local RAG, structured outputs with any model & more (X)* Vercel - Vo now has chat (X)* Ark - a completely offline device - offline LLM + worlds maps (X)* Ricky's Daughter coding with cursor video is a must watch (video)The Best of the Best: Open Source Wins with Jamba, Phi 3.5, and Surprise Function Calling HeroesWe kick things off this week by focusing on what we love the most on ThursdAI, open-source models! We had a ton of incredible releases this week, starting off with something we were super lucky to have live, the official announcement of AI21's latest LLM: Jamba.AI21 Officially Announces Jamba 1.5 Large/Mini – The Powerhouse Architecture Combines Transformer and Mamba While we've covered Jamba release on the show back in April, Jamba 1.5 is an updated powerhouse. It's 2 models, Large and Mini, both MoE and both are still hybrid architecture of Transformers + Mamba that try to get both worlds. Itay Dalmedigos, technical lead at AI21, joined us on the ThursdAI stage for an exclusive first look, giving us the full rundown on this developer-ready model with an awesome 256K context window, but it's not just the size – it’s about using that size effectively. AI21 measured the effective context use of their model on the new RULER benchmark released by NVIDIA, an iteration of the needle in the haystack and showed that their models have full utilization of context, as opposed to many other models.“As you mentioned, we’re able to pack many, many tokens on a single GPU. Uh, this is mostly due to the fact that we are able to quantize most of our parameters", Itay explained, diving into their secret sauce, ExpertsInt8, a novel quantization technique specifically designed for MoE models. Oh, and did we mention Jamba is multilingual (eight languages and counting), natively supports structured JSON, function calling, document digestion… basically everything developers dream of. They even chucked in citation generation, as it's long context can contain full documents, your RAG app may not even need to chunk anything, and the citation can cite full documents!Berkeley Function Calling Leaderboard V2: Updated + Live (link)Ever wondered how to measure the real-world magic of those models boasting "I can call functions! I can do tool use! Look how cool I am!" 😎? Enter the Berkeley Function Calling Leaderboard (BFCL) 2, a battleground where models clash to prove their function calling prowess.Version 2 just dropped, and this ain't your average benchmark, folks. It's armed with a "Live Dataset" - a dynamic, user-contributed treasure trove of real-world queries, rare function documentations, and specialized use-cases spanning multiple languages. Translation: NO more biased, contaminated datasets. BFCL 2 is as close to the real world as it gets.So, who’s sitting on the Function Calling throne this week? Our old friend Claude 3.5 Sonnet, with an impressive score of 73.61. But breathing down its neck is GPT 4-0613 (the OG Function Calling master) with 73.5. That's right, the one released a year ago, the first one with function calling, in fact the first LLM with function calling as a concept IIRC!Now, prepare for the REAL plot twist. The top-performing open-source model isn’t some big name, resource-heavy behemoth. It’s a tiny little underdog called Functionary Medium 3.1, a finetuned version of Llama 3.1 that blew everyone away. It even outscored both versions of Claude 3 Opus AND GPT 4 - leaving folks scrambling to figure out WHO created this masterpiece.“I’ve never heard of this model. It's MIT licensed from an organization called MeetKai. Have you guys heard about Functionary Medium?” I asked, echoing the collective bafflement in the space. Yep, turns out there’s gold hidden in the vast landscape of open source models, just waiting to be unearthed ⛏️.Microsoft updates Phi 3.5 - 3 new models including an MoE + MIT license3 new Phi's dropped this week, including an MoE one, and a new revamped vision one. They look very decent on benchmark yet again, with the mini version (3.8B) seemingly beating LLama 3.1 8B on a few benchmarks.However, as previously the excitement is met with caution because Phi models seem great on benchmarks but then actually talking with them, folks are not as impressed usually. Terry from BigCodeBench also saw a significant decrease in coding ability for Phi 3.5 vs 3.1 Of course, we're not complaining, the models released with 128K context and MIT license. The thing I'm most excited about is the vision model updates, it has been updated with "multi-frame image understanding and reasoning" which is a big deal! This means understanding videos more natively across scenes. This weeks BuzzHey, if you're reading this, while sitting in the bay area, and you don't have plans for exactly a month from now, why don't you come and hack with me? (Register Free)Announcing, the first W&B hackathon, Judgement Day that's going to be focused on LLM as a judge! Come hack on innovative LLM as a judge ideas, UIs, evals and more, meet other like minded hackers and AI engineers and win great prizes! 🎨 AI Art: Ideogram Crowns Itself King, Midjourney Joins the Internet & FLUX everywhereWhile there was little news from big LLM labs this week, there is a LOT of AI art news, which is fitting to celebrate 2 year Stable Diffusion 1.4 anniversary! 👑 Ideogram v2: Text Wizardry and API Access (But No Loras… Yet?)With significantly improved realism, and likely the best text generation across all models out there, Ideogram v2 just took over the AI image generation game! Just look at that text sharpness! They now offer a selection of styles (Realistic, Design, 3D, Anime) and any aspect ratios you'd like and also, brands can now provide color palettes to control the outputs! Adding to this is a new API offering (.8c per image for the main model, .5c for the new turbo model of v2!) and a new IOS app, they also added the option (for premium users only) to search through a billion generations and their prompts, which is a great offering as well, as sometimes you don't even know what to prompt. They claim a significant improvement over Flux[pro] and Dalle-3 in text, alignment and overall, interesting that MJ was not compared! Meanwhile, Midjourney finally launched a website and a free tier, so no longer do you have to learn to use Discord to even try Midjourney. Meanwhile Flux enjoys the fruits of Open SourceWhile the Ideogram and MJ fight it out for the closed source, Black Forest Labs enjoys the fruits of released their weights in the open. Fal just released an update that LORAs run 2.5x faster and 2.5x cheaper, CivitAI has LORAs for pretty much every character and celebrity ported to FLUX already, different techniques like ControlNets Unions, IPAdapters and more are being trained as we speak and tutorials upon tutorials are released of how to customize these models, for free (shoutout to my friend Matt Wolfe for this one)you can now train your own face on fal.ai , replicate.com and astria.ai , and thanks to astria, I was able to find some old generations of my LORAs from the 1.5 days (not quite 1.4, but still, enough to show the difference between then and now) and whoa. 🤔 Is This AI Tool Necessary, Bro?Let’s end with a topic that stirred up a hornets nest of opinions this week: Procreate, a beloved iPad design app, publicly declared their "fing hate” for Generative AI. Yeah, you read that right. Hate. The CEO, in a public statement went FULL scorched earth - proclaiming that AI-powered features would never sully the pristine code of their precious app.“Instead of trying to bridge the gap, he’s creating more walls", Wolfram commented, echoing the general “dude… what?” vibe in the space. “It feels marketeerial”, I added, pointing out the obvious PR play (while simultaneously acknowledging the very REAL, very LOUD segment of the Procreate community that cheered this decision).Here’s the thing: you can hate the tech. You can lament the potential demise of the human creative spark. You can rail against the looming AI overlords. But one thing’s undeniable: thi
Look these crazy weeks don't seem to stop, and though this week started out a bit slower (while folks were waiting to see how the speculation about certain red berry flavored conspiracies are shaking out) the big labs are shipping! We've got space uncle Elon dropping an "almost-gpt4" level Grok-2, that's uncensored, has access to real time data on X and can draw all kinds of images with Flux, OpenAI announced a new ChatGPT 4o version (not the one from last week that supported structured outputs, a different one!) and Anthropic dropping something that makes AI Engineers salivate! Oh, and for the second week in a row, ThursdAI live spaces were listened to by over 4K people, which is very humbling, and awesome because for example today, Nous Research announced Hermes 3 live on ThursdAI before the public heard about it (and I had a long chat w/ Emozilla about it, very well worth listening to)TL;DR of all topics covered: * Big CO LLMs + APIs* Xai releases GROK-2 - frontier level Grok, uncensored + image gen with Flux (𝕏, Blog, Try It)* OpenAI releases another ChatGPT-4o (and tops LMsys again) (X, Blog)* Google showcases Gemini Live, Pixel Bugs w/ Gemini, Google Assistant upgrades ( Blog)* Anthropic adds Prompt Caching in Beta - cutting costs by u to 90% (X, Blog)* AI Art & Diffusion & 3D* Flux now has support for LORAs, ControlNet, img2img (Fal, Replicate)* Google Imagen-3 is out of secret preview and it looks very good (𝕏, Paper, Try It)* This weeks Buzz* Using Weights & Biases Weave to evaluate Claude Prompt Caching (X, Github, Weave Dash)* Open Source LLMs * NousResearch drops Hermes 3 - 405B, 70B, 8B LLama 3.1 finetunes (X, Blog, Paper)* NVIDIA Llama-3.1-Minitron 4B (Blog, HF)* AnswerAI - colbert-small-v1 (Blog, HF)* Vision & Video* Runway Gen-3 Turbo is now available (Try It)Big Companies & LLM APIsGrok 2: Real Time Information, Uncensored as Hell, and… Flux?!The team at xAI definitely knows how to make a statement, dropping a knowledge bomb on us with the release of Grok 2. This isn't your uncle's dad joke model anymore - Grok 2 is a legitimate frontier model, folks.As Matt Shumer excitedly put it “If this model is this good with less than a year of work, the trajectory they’re on, it seems like they will be far above this...very very soon” 🚀Not only does Grok 2 have impressive scores on MMLU (beating the previous GPT-4o on their benchmarks… from MAY 2024), it even outperforms Llama 3 405B, proving that xAI isn't messing around.But here's where things get really interesting. Not only does this model access real time data through Twitter, which is a MOAT so wide you could probably park a rocket in it, it's also VERY uncensored. Think generating political content that'd make your grandma clutch her pearls or imagining Disney characters breaking bad in a way that’s both hilarious and kinda disturbing all thanks to Grok 2’s integration with Black Forest Labs Flux image generation model. With an affordable price point ($8/month for x Premium including access to Grok 2 and their killer MidJourney competitor?!), it’ll be interesting to see how Grok’s "truth seeking" (as xAI calls it) model plays out. Buckle up, folks, this is going to be wild, especially since all the normies now have the power to create political memes, that look VERY realistic, within seconds. Oh yeah… and there’s the upcoming Enterprise API as well… and Grok 2’s made its debut in the wild on the LMSys Arena, lurking incognito as "sus-column-r" and is now placed on TOP of Sonnet 3.5 and comes in as number 5 overall!OpenAI last ChatGPT is back at #1, but it's all very confusing 😵💫As the news about Grok-2 was settling in, OpenAI decided to, well… drop yet another GPT-4.o update on us. While Google was hosting their event no less. Seriously OpenAI? I guess they like to one-up Google's new releases (they also kicked Gemini from the #1 position after only 1 week there)So what was anonymous-chatbot in Lmsys for the past week, was also released in ChatGPT interface, is now the best LLM in the world according to LMSYS and other folks, it's #1 at Math, #1 at complex prompts, coding and #1 overall. It is also available for us developers via API, but... they don't recommend using it? 🤔 The most interesting thing about this release is, they don't really know to tell us why it's better, they just know that it is, qualitatively and that it's not a new frontier-class model (ie, not 🍓 or GPT5) Their release notes on this are something else 👇 Meanwhile it's been 3 months, and the promised Advanced Voice Mode is only in the hands of a few lucky testers so far. Anthropic Releases Prompt Caching to Slash API Prices By up to 90%Anthropic joined DeepSeek's game of "Let's Give Devs Affordable Intelligence," this week rolling out prompt caching with up to 90% cost reduction on cached tokens (yes NINETY…🤯 ) for those of you new to all this technical sorceryPrompt Caching allows the inference provider to save users money by reusing repeated chunks of a long prompt form cache, reducing pricing and increasing time to first token, and is especially beneficial for longer contexts (>100K) use-cases like conversations with books, agents with a lot of memory, 1000 examples in prompt etc'We covered caching before with Gemini (in Google IO) and last week with DeepSeek, but IMO this is a better implementation from a frontier lab that's easy to get started, manages the timeout for you (unlike Google) and is a no brainer implementation. And, you'll definitely want to see the code to implement it all yourself, (plus Weave is free!🤩):"In this week's buzz category… I used Weave, our LLM observability tooling to super quickly evaluate how much cheaper Cloud Caching from Anthropic really is, I did a video of it and I posted the code … If you're into this and want to see how to actually do this … how to evaluate, the code is there for you" - AlexWith the ridiculous 90% price drop for those cached calls (Haiku basically becomes FREE and cached Claude is costs like Haiku, .30 cents per 1Mtok). For context, I took 5 transcripts of 2 hour podcast conversations, and it amounted to ~110,000 tokens overall, and was able to ask questions across all this text, and it cost me less than $1 (see in the above video) Code Here + Weave evaluation Dashboard hereAI Art, Diffusion, and Personalized AI On the FlySpeaking of mind blowing, Flux took over this week, thanks in no small part to Elon strategically leveraging their tech in Grok (and everyone reminding everyone else, that it's not Grok creating images, it's Flux!)Now, remember, the REAL magic happens when code meets open source, “Flux now has support for LORAs, ControlNet, img2img…" meaning developers have turned those foundational tools into artistic wizardry. With as little as $5 bucks and a few pictures, “You can train the best image model on your own face. ”🤯 (Seriously folks, head up to Fal.ai, give it a whirl… it’s awesome)Now if you combine the LORA tech with ControlNet tech, you can get VERY creepy very fast (I'm using my own face here but you get the idea), here's "me" as the distracted boyfriend meme, and the girlfriend, and the distraction 😂 (I'm sorry you had to see this, AI has gone too far! Shut it all down!)If seeing those creepy faces on screen isn't for you (I totally get that) there’s also Google IMAGEN 3, freshly escaped from secret preview and just waiting for you to unleash those artistic prompts on it! Google, despite being… Google, somehow figured out that a little competition does a lab good and rolled out a model that’s seriously impressive.Runway Video Gets a "Turbocharged" Upgrade🚀🚀🚀Ever tried those jaw-dropping text-to-video generators but groaned as you watched those seconds of video render painfully slowly?😭 Well Runway, creators of Gen 3, answered our prayers with the distilled turbocharged version that churns out those visuals in a blink 🤯🤯🤯 .What's truly cool is they unlocked it for FREE tier users (sign up and unleash those cinematic prompts right now!), letting everyday folks dip their toes in those previously-unfathomable waters. Even the skeptics at OpenBMB (Junyang knows what I'm talking about…) had to acknowledge that their efforts with MiniCPM V are impressive, especially the smooth way it captures video sequences better than models even twice its size 🤯.Open Source: Hermes 3 and The Next Generation of Open AI 🚀NousResearch Dropped Hermes 3: Your New Favorite AI (Yes Really)In the ultimate “We Dropped This On ThursdAI Before Even HuggingFace”, the legendary team at NousResearch dropped the hottest news since Qwen decided to play math God: Hermes 3 is officially here! 🤯“You’re about to get to use the FIRST big Finetune of LLama 3.1 405B… We don’t think there have been finetunes,” announced Emozilla who’s both co founder and resident master wizard of all things neural net, “And it's available to try for free thanks to Lambda, you can try it out right here ” (you’re all racing to their site as I type this, I KNOW it!). Not ONLY does this beauty run ridiculously smooth on Lambda, but here’s the real TL;DR:* Hermes 3 isn’t just 405B; there are 70B and 8B versions dropping simultaneously on Hugging Face, ready to crush benchmarks and melt your VRAM (in a GOOD way… okay maybe not so great for your power bill 😅).* On Benchmark, they beat LLama 3.1 instruct on a few evals and lose on some, which is quite decent, given that Meta team did an amazing job with their instruct finetuning (and probably spent millions of $ on it too)* Hermes 3 is all about user alignment, which our open source champion Wolfram Ravenwolf summarized beautifully: “When you have a model, and you run it on your system, IT MUST BE LOYAL TO YOU.” 😈Hermes 3 does just that with incredibly precise control via its godlike system prompt: “In Hermes 3 the system prompt is KING,” confirmed Emoz. It’s so powerful that the 405B version was practically suffering existential angst in their first conversation… I read that part outloud during the space
Hold on tight, folks, because THIS week on ThursdAI felt like riding a roller coaster through the wild world of open-source AI - extreme highs, mind-bending twists, and a sprinkle of "wtf is happening?" conspiracy theories for good measure. 😂 Theme of this week is, Open Source keeps beating GPT-4, while we're inching towards intelligence too cheap to meter on the API fronts. We even had a live demo so epic, folks at the Large Hadron Collider are taking notice! Plus, strawberry shenanigans abound (did Sam REALLY tease GPT-5?), and your favorite AI evangelist nearly got canceled on X! Buckle up; this is gonna be another long one! 🚀Qwen2-Math Drops a KNOWLEDGE BOMB: Open Source Wins AGAIN!When I say "open source AI is unstoppable", I MEAN IT. This week, the brilliant minds from Alibaba's Qwen team decided to show everyone how it's DONE. Say hello to Qwen2-Math-72B-Instruct - a specialized language model SO GOOD at math, it's achieving a ridiculous 84 points on the MATH benchmark. 🤯For context, folks... that's beating GPT-4, Claude Sonnet 3.5, and Gemini 1.5 Pro. We're not talking incremental improvements here - this is a full-blown DOMINANCE of the field, and you can download and use it right now. 🔥Get Qwen-2 Math from HuggingFace hereWhat made this announcement EXTRA special was that Junyang Lin , the Chief Evangelist Officer at Alibaba Qwen team, joined ThursdAI moments after they released it, giving us a behind-the-scenes peek at the effort involved. Talk about being in the RIGHT place at the RIGHT time! 😂They painstakingly crafted a massive, math-specific training dataset, incorporating techniques like Chain-of-Thought reasoning (where the model thinks step-by-step) to unlock this insane level of mathematical intelligence."We have constructed a lot of data with the form of ... Chain of Thought ... And we find that it's actually very effective. And for the post-training, we have done a lot with rejection sampling to create a lot of data sets, so the model can learn how to generate the correct answers" - Junyang LinNow I gotta give mad props to Qwen for going beyond just raw performance - they're open-sourcing this beast under an Apache 2.0 license, meaning you're FREE to use it, fine-tune it, adapt it to your wildest mathematical needs! 🎉But hold on... the awesomeness doesn't stop there! Remember those smaller, resource-friendly LLMs everyone's obsessed with these days? Well, Qwen released 7B and even 1.5B versions of Qwen-2 Math, achieving jaw-dropping scores for their size (70 for the 1.5B?? That's unheard of!).🤯 Nisten nearly lost his mind when he heard that - and trust me, he's seen things. 😂"This is insane! This is... what, Sonnet 3.5 gets what, 71? 72? This gets 70? And it's a 1.5B? Like I could run that on someone's watch. Real." - NistenWith this level of efficiency, we're talking about AI-powered calculators, tutoring apps, research tools that run smoothly on everyday devices. The potential applications are endless!MiniCPM-V 2.6: A Pocket-Sized GPT-4 Vision... Seriously! 🤯If Qwen's Math marvel wasn't enough open-source goodness for ya, OpenBMB had to get in on the fun too! This time, they're bringing the 🔥 to vision with MiniCPM-V 2.6 - a ridiculous 8 billion parameter VLM (visual language model) that packs a serious punch, even outperforming GPT-4 Vision on OCR benchmarks!OpenBMB drops a bomb on X hereI'll say this straight up: talking about vision models in a TEXT-based post is hard. You gotta SEE it to believe it. But folks... TRUST ME on this one. This model is mind-blowing, capable of analyzing single images, multi-image sequences, and EVEN VIDEOS with an accuracy that rivaled my wildest hopes for open-source.🤯Check out their playground and prepare to be stunnedIt even captured every single nuance in this viral toddler speed-running video I threw at it, with an accuracy I haven't seen in models THIS small:"The video captures a young child's journey through an outdoor park setting. Initially, the child ... is seen sitting on a curved stone pathway besides a fountain, dressed in ... a green t-shirt and dark pants. As the video progresses, the child stands up and begins to walk ..."Junyang said that they actually collabbed with the OpenBMB team and knows firsthand how much effort went into training this model:"We actually have some collaborations with OpenBMB... it's very impressive that they are using, yeah, multi-images and video. And very impressive results. You can check the demo... the performance... We care a lot about MMMU [the benchmark], but... it is actually relying much on large language models." - Junyang LinNisten and I have been talking for months about the relationship between these visual "brains" and the larger language model base powering their "thinking." While it seems smaller models are catching up fast, combining a top-notch visual processor like MiniCPM-V with a monster LLM like Quen72B or Llama 405B could unlock truly unreal capabilities.This is why I'm excited - open source lets us mix and match like this! We can Frankenstein the best parts together and see what emerges... and it's usually something mind-blowing. 🤯Thank you for reading ThursdAI - Recaps of the most high signal AI weekly spaces. This post is public so feel free to share it.From the Large Hadron Collider to YOUR Phone: This Model Runs ANYWHERE 🚀While Qwen2-Math is breaking records on one hand, Nisten's latest creation, Biggie-SmoLlm, is showcasing the opposite side of the spectrum. Trying to get the smallest/fastest coherent LLM possible, Nisten blew up on HuggingFace.Biggie-SmoLlm (Hugging Face) is TINY, efficient, and with some incredible optimization work from the folks right here on the show, it's reaching an insane 330 tokens/second on regular M3 chips. 🤯 That's WAY faster than real-time conversation, folks! And thanks to Eric Hartford's (from Cognitive Computation) awesome new optimizer, (Grok AdamW) it's surprisingly coherent for such a lil' fella.The cherry on top? Someone messaged Nisten saying they're using Biggie-SmoLlm at the Large. Hadron. Collider. 😳 I'll let that sink in for a second.It was incredible having ALL the key players behind Biggie-SmoLlm right there on stage: LDJ (whose Capybara dataset made it teaching-friendly), Junyang (whose Qwen work served as the base), and Eric (the optimizer mastermind himself). THIS, my friends, is what the ThursdAI community is ALL about! 🚀Speaking of which this week we got a new friend of the pod, Mark Saroufim, a long time PyTorch core maintainer, to join the community. This Week's Buzz (and Yes, It Involves Making AI Even Smarter) 🤓NeurIPS Hacker Cup 2024 - Can You Solve Problems Humans Struggle With? 🤔I've gotta hand it to my PyTorch friend, Mark Saroufim. He knows how to make AI interesting! He and his incredible crew (Weiwei from MSFT, some WandB brainiacs, and more) are bringing you NeurIPS Hacker Cup 2024 - a competition to push those coding agents to their ABSOLUTE limits. 🚀This isn't your typical "LeetCode easy" challenge, folks... These are problems SO hard, years of competitive programming experience are required to even attempt them! Mark himself said, “At this point, like, if a model does make a significant dent in this competition, uh, I think people would need to acknowledge that, like, LLMs can do a form of planning. ”And don't worry, total beginners: Mark and Weights & Biases are hosting a series of FREE sessions to level you up. Get those brain cells prepped and ready for the challenge and then Join the NeurIPS Hacker Cup Discord P.S. We're ALSO starting a killer AI Salon series in our SF office August 15th! You'll get a chance to chat with researches like Shreya Shankar - she's a leading voice on evaluation. More details and free tickets right here! AI Salons LinkBig Co & APIs - Towards intelligence too cheap to meter Open-source was crushing it this week... but that didn't stop Big AI from throwing a few curveballs. OpenAI is doubling down on structured data (AND cheaper models!), Google slashed Gemini prices again (as we trend towards intelligence too cheap to meter), and a certain strawberry mystery took over Twitter.DeepSeek context caching lowers price by 90% automatiicallyDeepSeek, those masters of ridiculously-good coding AI, casually dropped a bombshell - context caching for their API! 🤯If you're like "wait, what does THAT mean?", listen up because this is game-changing for production-grade AI:* Problem: LLMs get fed the ENTIRE conversation history EVERY. SINGLE. TIME. This wastes compute (and $$$) when info is repeated.* Solution: DeepSeek now remembers what you've said, automatically pulling from a cache when the conversation goes down familiar paths.* The Win: Up to 90% cheaper API calls. Yes, NINETY.😳 It costs 1.4 CENTS per million tokens for cached content. Let THAT sink in. 🤯As Nisten (always bringing the technical breakdowns) explained:"Everyone should be using LLMs this way!...The simplest way is to have a long conversation ... then you save it on disk... you don't have to wait again ... [it's] kind of free. DeepSeek... did this in a more dynamic way". - NistenEven Matt Shumer, who usually advocates for clever prompting over massive context, got legitimately hyped about the possibilities:"For me, and how we use LLMs... instead of gathering a million examples... curate a hundred gold examples... you have something better than if you fine-tuned it, and cheaper, and faster..." - Matt ShumerThink about this... instead of painstakingly fine-tuning, we can "guide" models with expertly crafted examples, letting them learn "on the fly" with minimal cost. Context as the NEW fine-tuning! 🤯P.S - Google actually also has caching on its Gemini API, but you have to opt-in, while this happens automatically with DeepSeek API! Google Goes "Price War Nuclear": Gemini Flash is Officially TOO CHEAPSpeaking of sneaky advancements from Google... they also dropped an update SO casually impactful, it almost got lost in the shuffle. Gemini Flas
Starting Monday, Apple released iOS 18.1 with Apple Intelligence, then Meta dropped SAM-2 (Segment Anything Model) and then Google first open sourced Gemma 2B and now (just literally 2 hours ago, during the live show) released Gemini 1.5 0801 experimental that takes #1 on LMsys arena across multiple categories, to top it all off we also got a new SOTA image diffusion model called FLUX.1 from ex-stability folks and their new Black Forest Lab.This week on the show, we had Joseph & Piotr Skalski from Roboflow, talk in depth about Segment Anything, and as the absolute experts on this topic (Skalski is our returning vision expert), it was an incredible deep dive into the importance dedicated vision models (not VLMs).We also had Lukas Atkins & Fernando Neto from Arcee AI talk to use about their new DistillKit and explain model Distillation in detail & finally we had Cristiano Giardina who is one of the lucky few that got access to OpenAI advanced voice mode + his new friend GPT-4o came on the show as well!Honestly, how can one keep up with all this? by reading ThursdAI of course, that's how but ⚠️ buckle up, this is going to be a BIG one (I think over 4.5K words, will mark this as the longest newsletter I penned, I'm sorry, maybe read this one on 2x? 😂)[ Chapters ] 00:00 Introduction to the Hosts and Their Work01:22 Special Guests Introduction: Piotr Skalski and Joseph Nelson04:12 Segment Anything 2: Overview and Capabilities15:33 Deep Dive: Applications and Technical Details of SAM219:47 Combining SAM2 with Other Models36:16 Open Source AI: Importance and Future Directions39:59 Introduction to Distillation and DistillKit41:19 Introduction to DistilKit and Synthetic Data41:41 Distillation Techniques and Benefits44:10 Introducing Fernando and Distillation Basics44:49 Deep Dive into Distillation Process50:37 Open Source Contributions and Community Involvement52:04 ThursdAI Show Introduction and This Week's Buzz53:12 Weights & Biases New Course and San Francisco Meetup55:17 OpenAI's Advanced Voice Mode and Cristiano's Experience01:08:04 SearchGPT Release and Comparison with Perplexity01:11:37 Apple Intelligence Release and On-Device AI Capabilities01:22:30 Apple Intelligence and Local AI01:22:44 Breaking News: Black Forest Labs Emerges01:24:00 Exploring the New Flux Models01:25:54 Open Source Diffusion Models01:30:50 LLM Course and Free Resources01:32:26 FastHTML and Python Development01:33:26 Friend.com: Always-On Listening Device01:41:16 Google Gemini 1.5 Pro Takes the Lead01:48:45 GitHub Models: A New Era01:50:01 Concluding Thoughts and FarewellShow Notes & Links* Open Source LLMs* Meta gives SAM-2 - segment anything with one shot + video capability! (X, Blog, DEMO)* Google open sources Gemma 2 2.6B (Blog, HF)* MTEB Arena launching on HF - Embeddings head to head (HF)* Arcee AI announces DistillKit - (X, Blog, Github)* AI Art & Diffusion & 3D* Black Forest Labs - FLUX new SOTA diffusion models (X, Blog, Try It)* Midjourney 6.1 update - greater realism + potential Grok integration (X)* Big CO LLMs + APIs* Google updates Gemini 1.5 Pro with 0801 release and is #1 on LMsys arena (X)* OpenAI started alpha GPT-4o voice mode (examples)* OpenAI releases SearchGPT (Blog, Comparison w/ PPXL)* Apple releases beta of iOS 18.1 with Apple Intelligence (X, hands on, Intents )* Apple released a technical paper of apple intelligence* This weeks Buzz* AI Salons in SF + New Weave course for WandB featuring yours truly!* Vision & Video* Runway ML adds Gen -3 image to video and makes it 7x faster (X)* Tools & Hardware* Avi announces friend.com* Jeremy Howard releases FastHTML (Site, Video)* Applied LLM course from Hamel dropped all videosOpen SourceIt feels like everyone and their grandma is open sourcing incredible AI this week! Seriously, get ready for segment-anything-you-want + real-time-video capability PLUS small AND powerful language models.Meta Gives Us SAM-2: Segment ANYTHING Model in Images & Videos... With One Click!Hold on to your hats, folks! Remember Segment Anything, Meta's already-awesome image segmentation model? They've just ONE-UPPED themselves. Say hello to SAM-2 - it's real-time, promptable (you can TELL it what to segment), and handles VIDEOS like a champ. As I said on the show: "I was completely blown away by segment anything 2".But wait, what IS segmentation? Basically, pixel-perfect detection - outlining objects with incredible accuracy. My guests, the awesome Piotr Skalski and Joseph Nelson (computer vision pros from Roboflow), broke it down historically, from SAM 1 to SAM 2, and highlighted just how mind-blowing this upgrade is."So now, Segment Anything 2 comes out. Of course, it has all the previous capabilities of Segment Anything ... But the segment anything tool is awesome because it also can segment objects on the video". - Piotr SkalskiThink about Terminator vision from the "give me your clothes" bar scene: you see a scene, instantly "understand" every object separately, AND track it as it moves. SAM-2 gives us that, allowing you to click on a single frame, and BAM - perfect outlines that flow through the entire video! I played with their playground, and you NEED to try it - you can blur backgrounds, highlight specific objects... the possibilities are insane. Playground LinkIn this video, Piotr annotated only the first few frames of the top video, and SAM understood the bottom two shot from 2 different angles!Okay, cool tech, BUT why is it actually USEFUL? Well, Joseph gave us incredible examples - from easier sports analysis and visual effects (goodbye manual rotoscoping) to advances in microscopic research and even galactic exploration! Basically, any task requiring precise object identification gets boosted to a whole new level."SAM does an incredible job at creating pixel perfect outlines of everything inside visual scenes. And with SAM2, it does it across videos super well, too ... That capability is still being developed for a lot of AI Models and capabilities. So having very rich ability to understand what a thing is, where that thing is, how big that thing is, allows models to understand spaces and reason about them" - Joseph NelsonAND if you combine this power with other models (like Piotr is already doing!), you get zero-shot segmentation - literally type what you want to find, and the model will pinpoint it in your image/video. It's early days, but get ready for robotics applications, real-time video analysis, and who knows what else these clever hackers are dreaming up! 🤯Check out Piotr's Zero Shot Florence + Sam2 ImplementationBest of all? Apache 2 license, baby! As Joseph said, "Open source is foundational to making the accessibility, the use cases, and the advancement of the field overall", and this is a prime example. Huge kudos to Meta for empowering us with this tech.The whole conversation w/ Piotr & Joseph is very much worth listening to on the pod 🎙️Google Throws Down The Gauntlet: Open Sourcing GemMA 2 2.6BIt was Meta vs. Google on Monday because NOT to be outdone, Google also went on an open-sourcing spree. This time, they gifted us GemMA 2 (a 2.6 billion parameter powerhouse), alongside a safety-focused suite called ShieldGemMA AND a transparency tool called GemmaScope.So what makes Gemma 2 special? First off, it's optimized for on-device use, meaning super-efficient local running. BUT there's a catch, folks... They claim it beats Mixtral AND Llama 2 70B on the LMsys Arena leaderboard, with an ELO score of 1126. Hold on, a 2 billion parameter model outperforming the big boys? 🤨 As LDJ (one of my regular co-hosts) said on the show:"Yeah, I think my best theory here is... there's at least two or three variables at play ... In LMSys, people are much more likely to do single turn, and within LMSys, people will usually be biased more towards rating models with a more recent knowledge cutoff as higher".Translation? It might be gaming the system a bit, but either way, Gemma 2 is an exciting release - super fast, small enough for on-device applications, and coming with safety tools right out the gate! I think Zenova (our Hugging Face wizard) is already running this on WebGPU! You NEED to try it out.Gemma 2 HF LinkAnd GemmaScope? That's some cool, cool stuff too. Think about peeking inside the "brain" of the model - you can actually SEE how Gemma 2 processes information. Remember Anthropic Mechinterp? It's like that, giving us unprecedented transparency into how these systems actually "think". You gotta see it on Neuronpedia. Neuronpedia linkIt's Meta versus Google - round one, FIGHT! 🥊Distilling Knowlege: Arcee AI Drops DistilKit!Just when I thought the week was done throwing surprises, Arcee AI casually dropped DistilKit - an open source tool to build distilled language models. Now, this is some NEXT level stuff, folks. We talked with Lukas Atkins and Fernando (the brilliant minds behind DistillKit), and I finally learned what the heck "distillation" really means."TLDR - we teach a smaller model to think like a bigger model"In a nutshell: teach a smaller model how to think like a larger one. Think GPT-4o and GPT-4 Mini, where the smaller model supposedly got the "essence" of the bigger version. Or imagine a tiny Llama that inherited the smarts of 405B - ridiculous! 🤯 As Fernando eloquently put it:So in the finetuning that we have been doing, just in terms of generating text instructions and so on, we were observing only the token that was generated from the teacher model. And now with the distillation, we are observing the whole distribution of the tokens that could be sampledNow I admit, even after Fernando's expert breakdown, my brain still kind of melted. 🫠 BUT, here's why this matters: distilled models are super efficient, saving on cost and resources. Imagine powerful AI that runs seamlessly on your phone! 🤯 Arcee is making this possible for everyone.Check Out DistilKit HereWas it pure coincidence they released this on the same week as the Llama 3.1 LICENSE CHANGE (Zuckerberg is cl
Holy s**t, folks! I was off for two weeks, last week OpenAI released GPT-4o-mini and everyone was in my mentions saying, Alex, how are you missing this?? and I'm so glad I missed that last week and not this one, because while GPT-4o-mini is incredible (GPT-4o level distill with incredible speed and almost 99% cost reduction from 2 years ago?) it's not open source. So welcome back to ThursdAI, and buckle up because we're diving into what might just be the craziest week in open-source AI since... well, ever!This week, we saw Meta drop LLAMA 3.1 405B like it's hot (including updated 70B and 8B), Mistral joining the party with their Large V2, and DeepSeek quietly updating their coder V2 to blow our minds. Oh, and did I mention Google DeepMind casually solving math Olympiad problems at silver level medal 🥈? Yeah, it's been that kind of week.TL;DR of all topics covered: * Open Source* Meta LLama 3.1 updated models (405B, 70B, 8B) - Happy LLama Day! (X, Announcement, Zuck, Try It, Try it Faster, Evals, Provider evals)* Mistral Large V2 123B (X, HF, Blog, Try It)* DeepSeek-Coder-V2-0724 update (API only)* Big CO LLMs + APIs* 🥈 Google Deepmind wins silver medal at Math Olympiad - AlphaGeometry 2 (X)* OpenAI teases SearchGPT - their reimagined search experience (Blog)* OpenAI opens GPT-4o-mini finetunes + 2 month free (X)* This weeks Buzz* I compare 5 LLama API providers for speed and quantization using Weave (X)* Voice & Audio* Daily announces a new open standard for real time Voice and Video RTVI-AI (X, Try it, Github)Meta LLAMA 3.1: The 405B Open Weights Frontier Model Beating GPT-4 👑Let's start with the star of the show: Meta's LLAMA 3.1. This isn't just a 0.1 update; it's a whole new beast. We're talking about a 405 billion parameter model that's not just knocking on GPT-4's door – it's kicking it down.Here's the kicker: you can actually download this internet scale intelligence (if you have 820GB free). That's right, a state-of-the-art model beating GPT-4 on multiple benchmarks, and you can click a download button. As I said during the show, "This is not only refreshing, it's quite incredible."Some highlights:* 128K context window (finally!)* MMLU score of 88.6* Beats GPT-4 on several benchmarks like IFEval (88.6%), GSM8K (96.8%), and ARC Challenge (96.9%)* Has Tool Use capabilities (also beating GPT-4) and is Multilingual (ALSO BEATING GPT-4)But that's just scratching the surface. Let's dive deeper into what makes LLAMA 3.1 so special.The Power of Open WeightsMark Zuckerberg himself dropped an exclusive interview with our friend Rowan Cheng from Rundown AI. And let me tell you, Zuck's commitment to open-source AI is no joke. He talked about distillation, technical details, and even released a manifesto on why open AI (the concept, not the company) is "the way forward".As I mentioned during the show, "The fact that this dude, like my age, I think he's younger than me... knows what they released to this level of technical detail, while running a multi billion dollar company is just incredible to me."Evaluation ExtravaganzaThe evaluation results for LLAMA 3.1 are mind-blowing. We're not just talking about standard benchmarks here. The model is crushing it on multiple fronts:* MMLU (Massive Multitask Language Understanding): 88.6%* IFEval (Instruction Following): 88.6%* GSM8K (Grade School Math): 96.8%* ARC Challenge: 96.9%But it doesn't stop there. The fine folks at meta also for the first time added new categories like Tool Use (BFCL 88.5) and Multilinguality (Multilingual MGSM 91.6) (not to be confused with MultiModality which is not yet here, but soon) Now, these are official evaluations from Meta themselves, that we know, often don't really represent the quality of the model, so let's take a look at other, more vibey results shall we? On SEAL leaderboards from Scale (held back so can't be trained on) LLama 405B is beating ALL other models on Instruction Following, getting 4th at Coding and 2nd at Math tasks. On MixEval (the eval that approximates LMsys with 96% accuracy), my colleagues Ayush and Morgan got a whopping 66%, placing 405B just after Clause Sonnet 3.5 and above GPT-4oAnd there are more evals that all tell the same story, we have a winner here folks (see the rest of the evals in my thread roundup)The License Game-ChangerMeta didn't just release a powerful model; they also updated their license to allow for synthetic data creation and distillation. This is huge for the open-source community.LDJ highlighted its importance: "I think this is actually pretty important because even though, like you said, a lot of people still train on OpenAI outputs anyways, there's a lot of legal departments and a lot of small, medium, and large companies that they restrict the people building and fine-tuning AI models within that company from actually being able to build the best models that they can because of these restrictions."This update could lead to a boom in custom models and applications across various industries as companies can start distilling, finetuning and creating synthetic datasets using these incredibly smart models.405B: A Double-Edged SwordWhile the 405B model is incredibly powerful, it's not exactly practical for most production use cases as you need 2 nodes of 8 H100s to run it in full precision. Despite the fact that pricing wars already started, and we see inference providers at as low as 2.7$/1M tokens, this hardly makes sense when GPT-4o mini is 15 cents. However, this model shines in other areas:* Synthetic Data Generation & Distillation: Its power and the new license make it perfect for creating high-quality training data and use it to train smaller models* LLM as a Judge: The model's reasoning capabilities make it an excellent candidate for evaluating other AI outputs.* Research and Experimentation: For pushing the boundaries of what's possible in AI.The Smaller Siblings: 70B and 8BWhile the 405B model is grabbing headlines, don't sleep on its smaller siblings. The 70B and 8B models got significant upgrades too.The 70B model saw impressive gains:* MMLU: 80.9 to 86* IFEval: 82 to 87* GPQA: 39 to 46The 8B model, in particular, could be a hidden gem. As Kyle Corbitt from OpenPipe discovered, a fine-tuned 8B model could potentially beat a prompted GPT-4 Mini in specific tasks.No multi-modalityWhile Meta definitely addressed everything we had to ask for from the Llama 3 release, context window, incredible performance, multi-linguality, tool-use, we still haven't seen multi-modality with Llama. We still can't show it pictures or talk to it! However, apparently they have trained it to be mutli-modal as well but haven't yet released those weights, but they went into this in great detail in the paper and even showed a roadmap, stating that they will release it soon-ish (not in EU though)This Week's Buzz: Weave-ing Through LLama ProvidersIn the spirit of thorough evaluation, I couldn't resist putting LLAMA 3.1 through its paces across different providers. Using Weights & Biases Weave (https://wandb.me/weave), our evaluation and tracing framework for LLMs, I ran a comparison between various LLAMA providers.Here's what I found:* Different providers are running the model with varying optimizations (VLLM, FlashAttention3, etc.)* Some are serving quantized versions, which can affect output style and quality* Latency and throughput vary significantly between providersThe full results are available in a Weave comparison dashboard, which you can check out for a deep dive into the nuances of model deployment and code is up on Github if you want to verify this yourself or see how easy this is to do with WeaveMistral Crashes the Party with Large V2 123B model (X, HF, Blog, Try It)Just when we thought Meta had stolen the show, Mistral AI decided to drop their own bombshell: Mistral Large V2. This 123 billion parameter dense model is no joke, folks. With an MMLU score of 84.0, 128K context window and impressive performance across multiple benchmarks, it's giving LLAMA 3.1 a run for its money, especially in some coding tasks while being optimized to run on a single node!Especially interesting is the function calling on which they claim SOTA, without telling us which metric they used (or comparing to Llama 3.1) but are saying that they now support parallel and sequential function calling! DeepSeek updates DeepSeek Coder V2 to 0724While everyone was busy gawking at Meta and Mistral, DeepSeek quietly updated their coder model, and holy smokes, did they deliver! DeepSeek Coder v2 is now performing at GPT-4 and Claude 3.5 Sonnet levels on coding tasks. As Junyang Lin noted during our discussion, "DeepSeek Coder and DeepSeek Coder v2 should be the state of the art of the code-specific model."Here's the result from BigCodeBench and from Aider Chat (code editing dashboard)But it's not just about raw performance. DeepSeek is bringing some serious innovation to the table. They've added JSON mode, function calling, and even a fill-in-the-middle completion feature in beta. Plus, they've bumped up their max token generation to 8K. And let's talk about that API pricing – it's ridiculously cheap, at 14c / 1M tokens!. We're talking about costs that are competitive with GPT-4 Mini, but with potentially better performance on coding tasks. It's a game-changer for developers and companies looking to integrate powerful coding AI without breaking the bank.Google DeepMind's Math Wizardry: From Silver Medals to AI ProdigiesJust when we thought this week couldn't get any crazier, Google DeepMind decides to casually drop a bombshell that would make even the most decorated mathletes sweat. They've created an AI system that can solve International Mathematical Olympiad (IMO) problems at a silver medalist level. I mean, come on! As if the AI world wasn't moving fast enough, now we've got silicon-based Math Olympians?This isn't just any run-of-the-mill calculator on steroids. We're talking about a combination of AlphaProof, a new breakthrough
Hey all, Alex here… well, not actually here, I’m scheduling this post in advance, which I haven’t yet done, because I'm going on vacation! That’s right, next week is my birthday 🎉 and a much needed break, somewhere with a beach is awaiting, but I didn’t want to leave you hanging for too long, so posting this episode with some amazing un-released before material. Mixture of Agents x2Back in the far away days of June 20th (not that long ago but feels like ages!), Together AI announced a new paper, released code and posted a long post about a new method to collaboration between smaller models to beat larger models. They called it Mixture of Agents, and James Zou joined us to chat about that effort. Shortly after that - in fact, during the live ThursdAI show, Kyle Corbitt announced that OpenPipe also researched an approached similar to the above, using different models and a bit of a different reasoning, and also went after the coveted AlpacaEval benchmark, and achieved SOTA score of 68.8 using this method. And I was delighted to invite both James and Kyle to chat about their respective approach the same week that both broke AlpacaEval SOTA and hear how utilizing collaboration between LLMs can significantly improve their outputs! This weeks buzz - what I learned at W&B this weekSo much buzz this week from the Weave team, it’s hard to know what to put in here. I can start with the incredible integrations my team landed, Mistral AI, LLamaIndex, DSPy, OpenRouter and even Local Models served by Ollama, LmStudio, LLamaFile can be now auto tracked with Weave, which means you literally have to only instantiate Weave and it’ll auto track everything for you! But I think the biggest, hugest news from this week is this great eval comparison system that the Weave Tim just pushed, it’s honestly so feature rich that I’ll have to do a deeper dive on it later, but I wanted to make sure I include at least a few screencaps because I think it looks fantastic! Open Router - A unified interface for LLMsI’ve been a long time fan of OpenRouter.ai and I was very happy to have Alex Atallah on the show to talk about Open Router (even if this did happen back in April!) and I’m finally satisfied with the sound quality to released this conversation. Open Router is serving both foundational models like GPT, Claude, Gemini and also Open Source ones, and supports the OpenAI SDK format, making it super simple to play around and evaluate all of them on the same code. They even provide a few models for free! Right now you can use Phi for example completely free via their API. Alex goes deep into the areas of Open Router that I honestly didn’t really know about, like being a marketplace, knowing what trendy LLMs are being used by people in near real time (check out WebSim!) and more very interesting things! Give that conversation a listen, I’m sure you’ll enjoy it! That’s it folks, no news this week, I would instead like to recommend a new newsletter by friends of the pod Tanishq Abraham and Aran Komatsuzaki both of whom are doing a weekly paper X space and recently start posting it on Substack as well! It’s called AI papers of the week, and if you’re into papers which we don’t usually cover, there’s no better duo! In fact, Tanishq often used to come to ThursdAI to explain papers so you may recognize his voice :) See you all in two weeks after I get some seriously needed R&R 👋 😎🏖️ This is a public episode. If you’d like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
Hey everyone! Happy 4th of July to everyone who celebrates! I celebrated today by having an intimate conversation with 600 of my closest X friends 😂 Joking aside, today is a celebratory episode, 52nd consecutive weekly ThursdAI show! I've been doing this as a podcast for a year now!Which means, there are some of you, who've been subscribed for a year 😮 Thank you! Couldn't have done this without you. In the middle of my talk at AI Engineer (I still don't have the video!) I had to plug ThursdAI, and I asked the 300+ audience who is a listener of ThursdAI, and I saw a LOT of hands go up, which is honestly, still quite humbling. So again, thank you for tuning in, listening, subscribing, learning together with me and sharing with your friends! This week, we covered a new (soon to be) open source voice model from KyutAI, a LOT of open source LLM, from InternLM, Cognitive Computations (Eric Hartford joined us), Arcee AI (Lukas Atkins joined as well) and we have a deep dive into GraphRAG with Emil Eifrem CEO of Neo4j (who shares why it was called Neo4j in the first place, and that he's a ThursdAI listener, whaaat? 🤯), this is definitely a conversation you don't want to miss, so tune in, and read a breakdown below:TL;DR of all topics covered: * Voice & Audio* KyutAI releases Moshi - first ever 7B end to end voice capable model (Try it)* Open Source LLMs * Microsoft Updated Phi-3-mini - almost a new model * InternLM 2.5 - best open source model under 12B on Hugging Face (HF, Github)* Microsoft open sources GraphRAG (Announcement, Github, Paper)* OpenAutoCoder-Agentless - SOTA on SWE Bench - 27.33% (Code, Paper)* Arcee AI - Arcee Agent 7B - from Qwen2 - Function / Tool use finetune (HF)* LMsys announces RouteLLM - a new Open Source LLM Router (Github)* DeepSeek Chat got an significant upgrade (Announcement)* Nomic GPT4all 3.0 - Local LLM (Download, Github)* This weeks Buzz* New free Prompts course from WandB in 4 days (pre sign up)* Big CO LLMs + APIs* Perplexity announces their new pro research mode (Announcement)* X is rolling out "Grok Analysis" button and it's BAD in "fun mode" and then paused roll out* Figma pauses the rollout of their AI text to design tool "Make Design" (X)* Vision & Video* Cognitive Computations drops DolphinVision-72b - VLM (HF)* Chat with Emil Eifrem - CEO Neo4J about GraphRAG, AI EngineerVoice & AudioKyutAI Moshi - a 7B end to end voice model (Try It, See Announcement)Seemingly out of nowhere, another french AI juggernaut decided to drop a major announcement, a company called KyutAI, backed by Eric Schmidt, call themselves "the first European private-initiative laboratory dedicated to open research in artificial intelligence" in a press release back in November of 2023, have quite a few rockstar co founders ex Deep Mind, Meta AI, and have Yann LeCun on their science committee.This week they showed their first, and honestly quite mind-blowing release, called Moshi (Japanese for Hello, Moshi Moshi), which is an end to end voice and text model, similar to GPT-4o demos we've seen, except this one is 7B parameters, and can run on your mac! While the utility of the model right now is not the greatest, not remotely close to anything resembling the amazing GPT-4o (which was demoed live to me and all of AI Engineer by Romain Huet) but Moshi shows very very impressive stats! Built by a small team during only 6 months or so of work, they have trained an LLM (Helium 7B) an Audio Codec (Mimi) a Rust inference stack and a lot more, to give insane performance. Model latency is 160ms and mic-to-speakers latency is 200ms, which is so fast it seems like it's too fast. The demo often responds faster than I'm able to finish my sentence, and it results in an uncanny, "reading my thoughts" type feeling. The most important part is this though, a quote of KyutAI post after the announcement : Developing Moshi required significant contributions to audio codecs, multimodal LLMs, multimodal instruction-tuning and much more. We believe the main impact of the project will be sharing all Moshi’s secrets with the upcoming paper and open-source of the model.I'm really looking forward to how this tech can be applied to the incredible open source models we already have out there! Speaking to out LLMs is now officially here in the Open Source, way before we got GPT-4o and it's exciting! Open Source LLMs Microsoft stealth update Phi-3 Mini to make it almost a new modelSo stealth in fact, that I didn't even have this update in my notes for the show, but thanks to incredible community (Bartowsky, Akshay Gautam) who made sure we don't miss this, because it's so huge. The model used additional post-training data leading to substantial gains on instruction following and structure output. We also improve multi-turn conversation quality, explicitly support <|system|> tag, and significantly improve reasoning capabilityPhi-3 June update is quite significant across the board, just look at some of these scores, 354.78% improvement in JSON structure output, 30% at GPQABut also specifically for coding, a 33→93 jump in Java coding, 33→73 in Typescript, 27→ 85 in Python! These are just incredible numbers, and I definitely agree with Bartowski here, there's enough here to call this a whole new model rather than an "seasonal update" Qwen-2 is the start of the show right now Week in and week out, ThursdAI seems to be the watercooler for the best finetuners in the community to come, hang, share notes, and announce their models. A month after Qwen-2 was announced on ThursdAI stage live by friend of the pod and Qwen dev lead Junyang Lin, and a week after it re-took number 1 on the revamped open LLM leaderboard on HuggingFace, we now have great finetunes on top of Qwen-2. Qwen-2 is the star of the show right now. Because there's no better model. This is like GPT 4 level. It's Open Weights GPT 4. We can do what we want with it, and it's so powerful, and it's multilingual, and it's everything, it's like the dream model. I love itEric Hartford - Cognitive ComputationsWe've had 2 models finetunes based on Qwen 2 and their authors on the show this week, first was Lukas Atkins from Arcee AI (company behind MergeKit), they released Arcee Agent, a 7B Qwen-2 finetune/merge specifically focusing on tool use and function calling. We also had a chat with Eric Hartford from Cognitive Computations (which Lukas previously participated in) with the biggest open source VLM on top of Qwen-2, a 72B parameter Dolphin Vision (Trained by StableQuan, available on the HUB) ,and it's likely the biggest open source VLM that we've seen so far.The most exciting part about it, is Fernando Neta's "SexDrugsRockandroll" dataset, which supposedly contains, well.. a lot of uncensored stuff, and it's perfectly able to discuss and analyze images with mature and controversial content.InternLM 2.5 - SOTA open source under 12B with 1M context (HF, Github)The folks at Shanghai AI release InternLM 2.5 7B, and a chat version along with a whopping 1M context window extension. These metrics are ridiculous, beating LLama-3 8B on literally every metric on the new HF leaderboard, and even beating Llama-3 70B on MATH and coming close on GPQA!The folks at Intern not only released a beast of a model, but also have released a significantly imporved tool use capabilities with it, including their own agentic framework called Lagent, which comes with Code Interpreter (python execution), Search Capabilities, and of course the abilities to plug in your own tools.How will you serve 1M context on production you ask? Well, these folks ALSO open sourced LMDeploy, "an efficient, user-friendly toolkit designed for compressing, deploying, and serving LLM models" which has been around for a while, but is now supporting this new model of course, handles dynamic NTK and some offloading of context etc' So an incredible model + tools release, can't wait to play around with this! ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.This weeks Buzz (What I learned with WandB this week)Hey, did you know we at Weights & Biases have free courses? While some folks ask you for a LOT of money for basic courses, at Weights & Biases, they are... you guessed it, completely free! And a lot of effort goes into recording and building the agenda, so I'm happy to announce that our "Developer's Guide to LLM Prompting" course is going to launch in 4 days! Delivered by my colleague Anish (who's just an amazing educator) and Teodora from AutogenAI, you will learn everything prompt building related, and even if you are a seasoned prompting pro, there will be something for you there! Pre-register for the course HEREBig CO LLMs + APIsHow I helped roll back an XAI feature and Figma rolled back theirs We've covered Grok (with a K this time) from XAI multiple times, and while I don't use it's chat interface that much, or the open source model, I do think they have a huge benefit in having direct access to real time data from the X platform. Given that I basically live on X (to be able to deliver all these news to you) I started noticing a long promised, Grok Analysis button show up under some posts, first on mobile, then on web versions of X. Of course I had to test it, and whoa, I was honestly shocked at just how unhinged and profanity laced the analysis was. Now I'm not easily shocked, I've seen jailbroken LLMs before, I tried to get chatGPT to say curse words multiple times, but it's one thing when you expect it and a complete another thing when a billion dollar company releases a product that answers... well like this: Luckily Igor Babushkin (Co founder of XAI) noticed and the roll out was paused, so looks like I helped red team grok! 🫡 Figma pauses AI "make design" featureAnother AI feature was paused by a big company after going viral on X (what is it about X specifically?) and this time it was Figma! In a supe
Hey everyone, sending a quick one today, no deep dive, as I'm still in the middle of AI Engineer World's Fair 2024 in San Francisco (in fact, I'm writing this from the incredible floor 32 presidential suite, that the team here got for interviews, media and podcasting, and hey to all new folks who I’ve just met during the last two days!) It's been an incredible few days meeting so many ThursdAI community members, listeners and folks who came on the pod! The list honestly is too long but I've got to meet friends of the pod Maxime Labonne, Wing Lian, Joao Morra (crew AI), Vik from Moondream, Stefania Druga not to mention the countless folks who came up and gave high fives, introduced themselves, it was honestly a LOT of fun. (and it's still not over, if you're here, please come and say hi, and let's take a LLM judge selfie together!)On today's show, we recorded extra early because I had to run and play dress up, and boy am I relieved now that both the show and the talk are behind me, and I can go an enjoy the rest of the conference 🔥 (which I will bring you here in full once I get the recording!) On today's show, we had the awesome pleasure to have Surya Bhupatiraju who's a research engineer at Google DeepMind, talk to us about their newly released amazing Gemma 2 models! It was very technical, and a super great conversation to check out! Gemma 2 came out with 2 sizes, a 9B and a 27B parameter models, with 8K context (we addressed this on the show) and this 27B model incredible performance is beating LLama-3 70B on several benchmarks and is even beating Nemotron 340B from NVIDIA! This model is also now available on the Google AI studio to play with, but also on the hub! We also covered the renewal of the HuggingFace open LLM leaderboard with their new benchmarks in the mix and normalization of scores, and how Qwen 2 is again the best model that's tested! It's was a very insightful conversation, that's worth listening to if you're interested in benchmarks, definitely give it a listen. Last but not least, we had a conversation with Ethan Sutin, the co-founder of Bee Computer. At the AI Engineer speakers dinner, all the speakers received a wearable AI device as a gift, and I onboarded (cause Swyx asked me) and kinda forgot about it. On the way back to my hotel I walked with a friend and chatted about my life. When I got back to my hotel, the app prompted me with "hey, I now know 7 new facts about you" and it was incredible to see how much of the conversation it was able to pick up, and extract facts and eve TODO's! So I had to have Ethan on the show to try and dig a little bit into the privacy and the use-cases of these hardware AI devices, and it was a great chat! Sorry for the quick one today, if this is the first newsletter after you just met me and register, usually there’s a deeper dive here, expect a more in depth write-ups in the next sessions, as now I have to run down and enjoy the rest of the conference! Here's the TL;DR and my RAW show notes for the full show, in case it's helpful! * AI Engineer is happening right now in SF* Tracks include Multimodality, Open Models, RAG & LLM Frameworks, Agents, Al Leadership, Evals & LLM Ops, CodeGen & Dev Tools, Al in the Fortune 500, GPUs & Inference* Open Source LLMs * HuggingFace - LLM Leaderboard v2 - (Blog)* Old Benchmarks sucked and it's time to renew* New Benchmarks* MMLU-Pro (Massive Multitask Language Understanding - Pro version, paper)* GPQA (Google-Proof Q&A Benchmark, paper). GPQA is an extremely hard knowledge dataset* MuSR (Multistep Soft Reasoning, paper).* MATH (Mathematics Aptitude Test of Heuristics, Level 5 subset, paper)* IFEval (Instruction Following Evaluation, paper)* 🤝 BBH (Big Bench Hard, paper). BBH is a subset of 23 challenging tasks from the BigBench dataset* The community will be able to vote for models, and we will prioritize running models with the most votes first* Mozilla announces Builders Accelerator @ AI Engineer (X)* Theme: Local AI * 100K non dilutive funding* Google releases Gemma 2 (X, Blog)* Big CO LLMs + APIs* UMG, Sony, Warner sue Udio and Suno for copyright (X)* were able to recreate some songs* sue both companies* have 10 unnamed individuals who are also on the suit* Google Chrome Canary has Gemini nano (X)* * Super easy to use window.ai.createTextSession()* Nano 1 and 2, at a 4bit quantized 1.8B and 3.25B parameters has decent performance relative to Gemini Pro* Behind a feature flag* Most text gen under 500ms * Unclear re: hardware requirements * Someone already built extensions* someone already posted this on HuggingFace* Anthropic Claude share-able projects (X)* Snapshots of Claude conversations shared with your team* Can share custom instructions* Anthropic has released new "Projects" feature for Claude AI to enable collaboration and enhanced workflows* Projects allow users to ground Claude's outputs in their own internal knowledge and documents* Projects can be customized with instructions to tailor Claude's responses for specific tasks or perspectives* "Artifacts" feature allows users to see and interact with content generated by Claude alongside the conversation* Claude Team users can share their best conversations with Claude to inspire and uplevel the whole team* North Highland consultancy has seen 5x faster content creation and analysis using Claude* Anthropic is committed to user privacy and will not use shared data to train models without consent* Future plans include more integrations to bring in external knowledge sources for Claude* OpenAI voice mode update - not until Fall* AI Art & Diffusion & 3D* Fal open sourced AuraSR - a 600M upscaler based on GigaGAN (X, Fal)* Interview with Ethan Sutin from Bee Computer* We all got Bees as a gifts* AI Wearable that extracts TODOs, knows facts, etc'* This is a public episode. If you’d like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
Hey, this is Alex. Don't you just love when assumptions about LLMs hitting a wall just get shattered left and right and we get new incredible tools released that leapfrog previous state of the art models, that we barely got used to, from just a few months ago? I SURE DO! Today is one such day, this week was already busy enough, I had a whole 2 hour show packed with releases, and then Anthropic decided to give me a reason to use the #breakingNews button (the one that does the news show like sound on the live show, you should join next time!) and announced Claude Sonnet 3.5 which is their best model, beating Opus while being 2x faster and 5x cheaper! (also beating GPT-4o and Turbo, so... new king! For how long? ¯\_(ツ)_/¯)Critics are already raving, it's been half a day and they are raving! Ok, let's get to the TL;DR and then dive into Claude 3.5 and a few other incredible things that happened this week in AI! 👇 TL;DR of all topics covered: * Open Source LLMs * NVIDIA - Nemotron 340B - Base, Instruct and Reward model (X)* DeepSeek coder V2 (230B MoE, 16B) (X, HF)* Meta FAIR - Chameleon MMIO models (X)* HF + BigCodeProject are deprecating HumanEval with BigCodeBench (X, Bench)* NousResearch - Hermes 2 LLama3 Theta 70B - GPT-4 level OSS on MT-Bench (X, HF)* Big CO LLMs + APIs* Gemini Context Caching is available * Anthropic releases Sonnet 3.5 - beating GPT-4o (X, Claude.ai)* Ilya Sutskever starting SSI.inc - safe super intelligence (X)* Nvidia is the biggest company in the world by market cap* This weeks Buzz * Alex in SF next week for AIQCon, AI Engineer. ThursdAI will be sporadic but will happen!* W&B Weave now has support for tokens and cost + Anthropic SDK out of the box (Weave Docs)* Vision & Video* Microsoft open sources Florence 230M & 800M Vision Models (X, HF)* Runway Gen-3 - (t2v, i2v, v2v) Video Model (X)* Voice & Audio* Google Deepmind teases V2A video-to-audio model (Blog)* AI Art & Diffusion & 3D* Flash Diffusion for SD3 is out - Stable Diffusion 3 in 4 steps! (X)ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.🦀 New king of LLMs in town - Claude 3.5 Sonnet 👑 Ok so first things first, Claude Sonnet, the previously forgotten middle child of the Claude 3 family, has now received a brain upgrade! Achieving incredible performance on many benchmarks, this new model is 5 times cheaper than Opus at $3/1Mtok on input and $15/1Mtok on output. It's also competitive against GPT-4o and turbo on the standard benchmarks, achieving incredible scores on MMLU, HumanEval etc', but we know that those are already behind us. Sonnet 3.5, aka Claw'd (which is a great marketing push by the Anthropic folks, I love to see it), is beating all other models on Aider.chat code editing leaderboard, winning on the new livebench.ai leaderboard and is getting top scores on MixEval Hard, which has 96% correlation with LMsys arena.While benchmarks are great and all, real folks are reporting real findings of their own, here's what Friend of the Pod Pietro Skirano had to say after playing with it: there's like a lot of things that I saw that I had never seen before in terms of like creativity and like how much of the model, you know, actually put some of his own understanding into your request-@SkiranoWhat's notable a capability boost is this quote from the Anthropic release blog: In an internal agentic coding evaluation, Claude 3.5 Sonnet solved 64% of problems, outperforming Claude 3 Opus which solved 38%. One detail that Alex Albert from Anthropic pointed out from this released was, that on GPQA (Graduate-Level Google-Proof Q&A) Benchmark, they achieved a 67% with various prompting techniques, beating PHD experts in respective fields in this benchmarks that average 65% on this. This... this is crazyBeyond just the benchmarks This to me is a ridiculous jump because Opus was just so so good already, and Sonnet 3.5 is jumping over it with agentic solving capabilities, and also vision capabilities. Anthropic also announced that vision wise, Claw'd is significantly better than Opus at vision tasks (which, again, Opus was already great at!) and lastly, Claw'd now has a great recent cutoff time, it knows about events that happened in February 2024! Additionally, claude.ai got a new capability which significantly improves the use of Claude, which they call artifacts. It needs to be turned on in settings, and then Claude will have access to files, and will show you in an aside, rendered HTML, SVG files, Markdown docs, and a bunch more stuff, and it'll be able to reference different files it creates, to create assets and then a game with these assets for example! 1 Ilya x 2 Daniels to build Safe SuperIntelligence Ilya Sutskever, Co-founder and failed board Coup participant (leader?) at OpenAI, has resurfaced after a long time of people wondering "where's Ilya" with one hell of an announcement. The company is called SSI of Safe Super Intelligence, and he's cofounding it with Daniel Levy (prev OpenAI, PHD Stanford) and Daniel Gross (AI @ Apple, AIgrant, AI Investor). The only mandate of this company is apparently to have a straight shot at safe super-intelligence, skipping AGI, which is no longer the buzzword (Ilya is famous for the "feel the AGI" chant within OpenAI) Notable also that the company will be split between Palo Alto and Tel Aviv, where they have the ability to hire top talent into a "cracked team of researchers"Our singular focus means no distraction by management overhead or product cyclesGood luck to these folks! Open Source LLMs DeepSeek coder V2 (230B MoE, 16B) (X, HF)The folks at DeepSeek are not shy about their results, and until the Sonnet release above, have released a 230B MoE model that beats GPT4-Turbo at Coding and Math! With a great new 128K context window and an incredible open license (you can use this in production!) this model is the best open source coder in town, getting to number 3 on aider code editing and number 2 on BigCodeBench (which is a new Benchmark we covered on the pod with the maintainer, definitely worth a listen. HumanEval is old and getting irrelevant) Notable also that DeepSeek has launched an API service that seems to be so competitively priced that it doesn't make sense to use anything else, with $0.14/$0.28 I/O per Million Tokens, it's a whopping 42 times cheaper than Claw'd 3.5! Support of 338 programming languages, it should also run super quick given it's MoE architecture, the bigger model is only 21B active parameters which scales amazing on CPUs. They also released a tiny 16B MoE model called Lite-instruct and it's 2.4B active params. This weeks Buzz (What I learned with WandB this week)Folks, in a week, I'm going to go up on stage in front of tons of AI Engineers wearing a costume, and... it's going to be epic! I finished writing my talk, now I'm practicing and I'm very excited. If you're there, please join the Evals track 🙂 Also in W&B this week, coinciding with Claw'd release, we've added a native integration with the Anthropic Python SDK which now means that all you need to do to track your LLM calls with Claw'd is pip install weave and import weave and weave.init('your project name' THAT'S IT! and you get this amazing dashboard with usage tracking for all your Claw'd calls for free, it's really crazy easy, give it a try! Vision & Video Runway Gen-3 - SORA like video model announced (X, blog)Runway, you know the company who everyone was "sorry for" when SORA was announced by OpenAI, is not sitting around waiting to "be killed" and is announcing Gen-3, an incredible video model capable of realistic video generations, physics understanding, and a lot lot more. The videos took over my timeline, and this looks to my eyes better than KLING and better than Luma Dream Machine from last week, by quite a lot! Not to mention that Runway has been in video production for way longer than most, so they have other tools that work with this model, like motion brush, lip syncing, temporal controls and many more, that allow you to be the director of the exactly the right scene. Google Deepmind video-to-audio (X)You're going to need to turn your sound on for this one! Google has released a tease of a new model of theirs that can be paired amazingly well with the above type generative video models (of which Google also has one, that they've teased and it's coming bla bla bla) This one, watches your video and provides acoustic sound fitting the scene, with on-sceen action sound! They showed a few examples and honestly they look so good, a drummer playing drums and that model generated the drums sounds etc' 👏 Will we ever see this as a product from google though? Nobody knows! Microsoft releases tiny (0.23B, 0.77B) Vision Models Florence (X, HF, Try It)This one is a very exciting release because it's MIT licensed, and TINY! Less than 1 Billion parameters, meaning it can completely run on device, it's a vision model, that beats MUCH bigger vision models by a significant amount on tasks like OCR, segmentation, object detection, image captioning and more! They have leveraged (and supposedly going to release) a FLD-5B dataset, and they have specifically made this model to be fine-tunable across these tasks, which is exciting because open source vision models are going to significantly benefit from this release almost immediately. Just look at this hand written OCR capability! Stellar! NousResearch - Hermes 2 Theta 70B - inching over GPT-4 on MT-BenchTeknium and the Nous Reseach crew have released a new model just to mess with me, you see, the live show was already recorded and edited, the file exported, the TL'DR written, and the newsletter draft almost ready to submit, and then I check the Green Room (DM group for all friends of the pod for ThursdAI, it's really an awesome Group Chat) and Teknium drops that they've beat GPT-4 (unsure which version) on MT-Bench with a finetune an
Happy Apple AI week everyone (well, those of us who celebrate, some don't) as this week we finally got told what Apple is planning to do with this whole generative AI wave and presented Apple Intelligence (which is AI, get it? they are trying to rebrand AI!)This weeks pod and newsletter main focus will be Apple Intelligence of course, as it was for most people compared to how the market reacted ($APPL grew over $360B in a few days after this announcement) and how many people watched each live stream (10M at the time of this writing watched the WWDC keynote on youtube, compared to 4.5 for the OpenAI GPT-4o, 1.8 M for Google IO) On the pod we also geeked out on new eval frameworks and benchmarks including a chat with the authors of MixEvals which I wrote about last week and a new benchmark called Live Bench from Abacus and Yan LecunPlus a new video model from Luma and finally SD3, let's go! 👇 TL;DR of all topics covered: * Apple WWDC recap and Apple Intelligence (X)* This Weeks Buzz* AI Engineer expo in SF (June 25-27) come see my talk, it's going to be Epic (X, Schedule)* Open Source LLMs * Microsoft Samba - 3.8B MAMBA + Sliding Window Attention beating Phi 3 (X, Paper)* Sakana AI releases LLM squared - LLMs coming up with preference algorithms to train better LLMS (X, Blog)* Abacus + Yan Lecun release LiveBench.AI - impossible to game benchmark (X, Bench* Interview with MixEval folks about achieving 96% arena accuracy with 5000x less price* Big CO LLMs + APIs* Mistral announced a 600M series B round* Revenue at OpenAI DOUBLED in the last 6 month and is now at $3.4B annualized (source)* Elon drops lawsuit vs OpenAI * Vision & Video* Luma drops DreamMachine - SORA like short video generation in free access (X, TRY IT)* AI Art & Diffusion & 3D* Stable Diffusion Medium weights are here (X, HF, FAL)* Tools* Google releases GenType - create an alphabet with diffusion Models (X, Try It)Apple IntelligenceTechnical LLM details Let's dive right into what wasn't show on the keynote, in a 6 minute deep dive video from the state of the union for developers and in a follow up post on machine learning blog, Apple shared some very exciting technical details about their on device models and orchestration that will become Apple Intelligence. Namely, on device they have trained a bespoke 3B parameter LLM, which was trained on licensed data, and uses a bunch of very cutting edge modern techniques to achieve quite an incredible on device performance. Stuff like GQA, Speculative Decoding, a very unique type of quantization (which they claim is almost lossless) To maintain model , we developed a new framework using LoRA adapters that incorporates a mixed 2-bit and 4-bit configuration strategy — averaging 3.5 bits-per-weight — to achieve the same accuracy as the uncompressed models [...] on iPhone 15 Pro we are able to reach time-to-first-token latency of about 0.6 millisecond per prompt token, and a generation rate of 30 tokens per secondThese small models (they also have a bespoke image diffusion model as well) are going to be finetuned with a lot of LORA adapters for specific tasks like Summarization, Query handling, Mail replies, Urgency and more, which gives their foundational models the ability to specialize itself on the fly to the task at hand, and be cached in memory as well for optimal performance. Personal and Private (including in the cloud) While these models are small, they will also benefit from 2 more things on device, a vector store of your stuff (contacts, recent chats, calendar, photos) they call semantic index and a new thing apple is calling App Intents, which developers can expose (and the OS apps already do) that will allows the LLM to use tools like moving files, extracting data across apps, and do actions, this already makes the AI much more personal and helpful as it has in its context things about me and what my apps can do on my phone. Handoff to the Private Cloud (and then to OpenAI)What the local 3B LLM + context can't do, it'll hand off to the cloud, in what Apple claims is a very secure way, called Private Cloud, in which they will create a new inference techniques in the cloud, on Apple Silicon, with Secure Enclave and Secure Boot, ensuring that the LLM sessions that run inference on your data are never stored, and even Apple can't access those sessions, not to mention train their LLMs on your data. Here are some benchmarks Apple posted for their On-Device 3B model and unknown size server model comparing it to GPT-4-Turbo (not 4o!) on unnamed benchmarks they came up with. In cases where Apple Intelligence cannot help you with a request (I'm still unclear when this actually would happen) IOS will now show you this dialog, suggesting you use chatGPT from OpenAI, marking a deal with OpenAI (in which apparently nobody pays nobody, so neither Apple is getting paid by OpenAI to be placed there, nor does Apple pay OpenAI for the additional compute, tokens, and inference) Implementations across the OSSo what will people be able to actually do with this intelligence? I'm sure that Apple will add much more in the next versions of iOS, but at least for now, Siri is getting an LLM brain transplant and is going to be much more smarter and capable, from understanding natural speech better (and just, having better ears, the on device speech to text is improved and is really good now in IOS 18 beta) to being able to use app intents to do actions for you across several apps. Other features across the OS will use Apple Intelligence to prioritize your notifications, and also summarize group chats that are going off, and have built in tools for rewriting, summarizing, and turning any text anywhere into anything else. Basically think of many of the tasks you'd use chatGPT for, are now built into the OS level itself for free. Apple is also adding AI Art diffusion features like GenMoji (the ability to generate any emoji you can think of, like chefs kiss, or a seal with a clown nose) and while this sounds absurd, I've never been in a slack or a discord that didn't have their own unique custom emojis uploaded by their members. And one last feature I'll highlight is this Image Playground, Apple's take on generating images, which is not only just text, but a contextual understanding of your conversation, and let's you create with autosuggested concepts instead of just text prompts and is going to be available to all developers to bake into their apps. Elon is SALTY - and it's not because of privacyI wasn't sure if to include this segment, but in what became my most viral tweet since the beginning of this year, I posted about Elon muddying the water about what Apple actually announced, and called it a Psyop that worked. Many MSMs and definitely the narrative on X, turned into what Elon thinks about those announcements, rather than the announcements themselves and just look at this insane reach.We've covered Elon vs OpenAI before (a lawsuit that he actually withdrew this week, because emails came out showing he knew and was ok with OpenAI not being Open) and so it's no surprise that when Apple decided to partner with OpenAI and not say... XAI, Elon would promote absolutely incorrect and ignorant takes to take over the radio waves like he will ban apple devices from all his companies, or that OpenAI will get access to train on your iPhone data. This weeks BUZZ (Weights & Biases Update) Hey, if you're reading this, it's very likely that you've already registered or at least heard of ai.engineer and if you haven't, well I'm delighted to tell you, that we're sponsoring this awesome event in San Francisco June 25-27. Not only are we official sponsors, both Lukas (the Co-Founder and CEO) and I will be there giving talks (mine will likely be crazier than his) and we'll have a booth there, so if your'e coming, make sure to come by my talk (or Lukas's if you're a VP and are signed up for that exclusive track) Everyone in our corder of the world is going to be there, Swyx told me that many of the foundational models labs are coming, OpenAI, Anthropic, Google, and there's going to be tons of tracks (My talk is of course in the Evals track, come, really, I might embarrass myself on stage to eternity you don't want to miss this) Swyx kindly provided listeners and readers of ThursdAI with a special coupon feeltheagi so even more of a reason to try and convince your boss and come see me on stage in a costume (I've said too much!) Vision & VideoLuma drops DreamMachine - SORA like short video generation in free access (X, TRY IT)In an absolute surprise, Luma AI, a company that (used to) specialize in crafting 3D models, has released a free access video model similar to SORA, and Kling (which we covered last week) that generates 5 second videos (and doesn't require a chinese phone # haha) It's free to try, and supports text to video, image to video, cinematic prompt instructions, great and cohesive narrative following, character consistency and a lot more. Here's a comparison of the famous SORA videos and LDM (Luma Dream Machine) videos that I was provided on X by a AmebaGPT, however, worth noting that these are cherry picked SORA videos while LDM is likely a much smaller and quicker model and that folks are creating some incredible things already! AI Art & Diffusion & 3D Stable Diffusion Medium weights are here (X, HF, FAL)It's finally here (well, I'm using finally carefully here, and really hoping that this isn't the last thing Stability AI releases) ,the weights for Stable Diffusion 3 are available on HuggingFace! SD3 offers improved photorealism and awesome prompt adherence, like asking for multiple subjects doing multiple things. It's also pretty good at typography and fairly resource efficient compared to previuos versions, though I'm still waiting for the super turbo distilled versions that will likely come soon! ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support
Hey hey! This is Alex! 👋 Some podcasts have 1 or maaaybe 2 guests an episode, we had 6! guests today, each has had an announcement, an open source release, or a breaking news story that we've covered! (PS, this edition is very multimodal so click into the Substack as videos don't play in your inbox)As you know my favorite thing is to host the folks who make the news to let them do their own announcements, but also, hitting that BREAKING NEWS button when something is actually breaking (as in, happened just before or during the show) and I've actually used it 3 times this show! It's not every week that we get to announce a NEW SOTA open model with the team that worked on it. Junyang (Justin) Lin from Qwen is a friend of the pod, a frequent co-host, and today gave us the breaking news of this month, as Qwen2 72B, is beating LLama-3 70B on most benchmarks! That's right, a new state of the art open LLM was announced on the show, and Justin went deep into details 👏 (so don't miss this conversation, listen to wherever you get your podcasts) We also chatted about SOTA multimodal embeddings with Jina folks (Bo Wand and Han Xiao) and Zach from Nomic, dove into an open source compute grant with FALs Batuhan Taskaya and much more! TL;DR of all topics covered: * Open Source LLMs * Alibaba announces Qwen 2 - 5 model suite (X, HF)* Jina announces Jina-Clip V1 - multimodal embeddings beating CLIP from OAI (X, Blog, Web Demo)* Nomic announces Nomic-Embed-Vision (X, BLOG)* MixEval - arena style rankings with Chatbot Arena model rankings with 2000× less time (5 minutes) and 5000× less cost ($0.6) (X, Blog)* Vision & Video* Kling - open access video model SORA competitor from China (X)* This Weeks Buzz * WandB supports Mistral new finetuning service (X)* Register to my June 12 workshop on building Evals with Weave HERE* Voice & Audio* StableAudio Open - X, BLOG, TRY IT* Suno launches "upload your audio" feature to select few - X * Udio - upload your own audio feature - X* AI Art & Diffusion & 3D* Stable Diffusion 3 weights are coming on June 12th (Blog)* JasperAI releases Flash Diffusion (X, TRY IT, Blog)* Big CO LLMs + APIs* Group of ex-OpenAI sign a new letter - righttowarn.ai * A hacker releases TotalRecall - a tool to extract all the info from MS Recall Feature (Github)Open Source LLMs QWEN 2 - new SOTA open model from Alibaba (X, HF)This is definitely the biggest news for this week, as the folks at Alibaba released a very surprising and super high quality suite of models, spanning from a tiny 0.5B model to a new leader in open models, Qwen 2 72B To add to the distance from Llama-3, these new models support a wide range of context length, all large, with 7B and 72B support up to 128K context. Justin mentioned on stage that actually finding sequences of longer context lengths is challenging, and this is why they are only at 128K.In terms of advancements, the highlight is advanced Code and Math capabilities, which are likely to contribute to overall model advancements across other benchmarks as well. It's also important to note that all models (besides the 72B) are now released with Apache 2 license to help folks actually use globally, and speaking of globality, these models have been natively trained with 27 additional languages, making them considerably better at multilingual prompts! One additional amazing thing was, that a finetune was released by Eric Hartford and Cognitive Computations team, and AFAIK this is the first time a new model drops with an external finetune. Justing literally said "It is quite amazing. I don't know how they did that. Well, our teammates don't know how they did that, but, uh, it is really amazing when they use the Dolphin dataset to train it."Here's the Dolphin finetune metrics and you can try it out hereThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.Jina-Clip V1 and Nomic-Embed-Vision SOTA multimodal embeddingsIt's quite remarkable that we got 2 separate SOTA of a similar thing during the same week, and even more cool that both companies came to talk about it on ThursdAI! First we welcomed back Bo Wang from Jina (who joined by Han Xiao the CEO) and Bo talked about multimodal embeddings that beat OpenAI CLIP (which both conceded was a very low plank) Jina Clip V1 is apache 2 open sourced, while Nomic Embed is beating it on benchmarks but is CC-BY-NC non commercially licensed, but in most cases, if you're embedding, you'd likely use an API, and both companies offer these embeddings via their respective APIsOne thing to note about Nomic, is that they have mentioned that these new embeddings are backwards compatible with the awesome Nomic embed endpoints and embeddings, so if you've used that, now you've gone multimodal! Because these models are fairly small, there are now web versions, thanks to transformer.js, of Jina and Nomic Embed (caution, this will download large-ish files) built by non other than our friend Xenova.If you're building any type of multimodal semantic search, these two embeddings systems are now open up all your RAG needs for multi modal data! This weeks Buzz (What I learned with WandB this week)Mistral announced built in finetuning server support, and has a simple WandB integration! (X)Also, my workshop about building evals 101 is coming up next week, June 12, excited to share with you a workshop that we wrote for in person crowd, please register here and hope to see you next week! Vision & VideoNew SORA like video generation model called KLING in open access (DEMO)This one has to be seen to be believed, out of nowhere, an obscure (to us) chinese company kuaishou.com dropped a landing page with tons of videos that are clearly AI generated, and they all look very close to SORA quality, way surpassing everything else we've seen in this category (Runaway, Pika, SVD) And they claim that they offer support for it via their App (but you need apparently a Chinese phone number, so not for me) It's really hard to believe that this quality exists already outside of a frontier lab full of GPUs like OpenAI and it's now in waitlist mode, where as SORA is "coming soon" Voice & AudioStability open sources Stable Audio Open (X, BLOG, TRY IT)A new open model from Stability is always fun, and while we wait for SD3 to drop weights (June 12! we finally have a date) we get this awesome model from Dadabots at team at Stability. It's able to generate 47s seconds of music, and is awesome at generating loops, drums and other non vocal stuff, so not quite where Suno/Udio are, but the samples are very clean and sound very good. Prompt: New York SubwayThey focus the model on being able to get Finetuned on a specific drummers style for example, and have it be open and specialize in samples, and sound effects and not focused on melodies or finalized full songs but it has some decent skills in simple prompts, like "progressive house music"This model has a non commercial license and can be played with hereSuno & Udio let users upload their own audio! This one is big, so big in fact, that I am very surprised that both companies announced this exact feature the same week. Suno has reached out to me and a bunch of other creators, and told us that we are now able to upload our own clips, be it someone playing solo guitar, or even whistling and have Suno remix it into a real proper song. In this example, this is a very viral video, this guy sings at a market selling fish (to ladies?) and Suno was able to create this remix for me, with the drop, the changes in his voice, the melody, everything, it’s quite remarkable! AI Art & DiffusionFlash Diffusion from JasperAI / Clipdrop team (X, TRY IT, Blog, Paper)Last but definitely not least, we now have a banger of a diffusion update, from the Clipdrop team (who was amazing things before Stability bought them and then sold them to JasperAI) Diffusion models likle Stable Diffusion often take 30-40 inference steps to get you the image, searching for your prompt through latent space you know? Well recently there have been tons of these new distill methonds, models that are like students, who learn from the teacher model (Stable Diffusion XL for example) and distill the same down to a few steps (sometimes as low as 2!) Often the results are, distilled models that can run in real time, like SDXL Turbo, Lightning SDXL etcNow Flash Diffusion achieves State-of-the-Art (SOTA) performance metrics, specifically in terms of Fréchet Inception Distance (FID) and CLIP Score. These metrics are the default for evaluating the quality and relevance of generated images. And Jasper has open sourced the whole training code to allow for reproducibility which is very welcome!Flash diffusion also comes in not only image generation, but also inpaining and upscaling, allowing it to be applied to other methods to speed them up as well. — This is all for this week, I mean, there are TONS more stuff we could have covered, and we did mention them on the pod, but I aim to serve as a filter to the most interesting things as well so, until next week 🫡 This is a public episode. If you’d like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
Hey everyone, Alex here! Can you believe it's already end of May? And that 2 huge AI companies conferences are behind us (Google IO, MSFT Build) and Apple's WWDC is just ahead in 10 days! Exciting! I was really looking forward to today's show, had quite a few guests today, I'll add all their socials below the TL;DR so please give them a follow and if you're only in reading mode of the newsletter, why don't you give the podcast a try 🙂 It's impossible for me to add the density of knowledge that's being shared on stage for 2 hours here in the newsletter! Also, before we dive in, I’m hosting a free workshop soon, about building evaluations from scratch, if you’re building anything with LLMs in production, more than welcome to join us on June 12th (it’ll be virtual)TL;DR of all topics covered: * Open Source LLMs * Mistral open weights Codestral - 22B dense coding model (X, Blog)* Nvidia open sources NV-Embed-v1 - Mistral based SOTA embeddings (X, HF)* HuggingFace Chat with tool support (X, demo)* Aider beats SOTA on Swe-Bench with 26% (X, Blog, Github)* OpenChat - Sota finetune of Llama3 (X, HF, Try It)* LLM 360 - K2 65B - fully transparent and reproducible (X, Paper, HF, WandB)* Big CO LLMs + APIs* Scale announces SEAL Leaderboards - with private Evals (X, leaderboard)* SambaNova achieves >1000T/s on Llama-3 full precision* Groq hits back with breaking 1200T/s on Llama-3* Anthropic tool support in GA (X, Blogpost)* OpenAI adds GPT4o, Web Search, Vision, Code Interpreter & more to free users (X)* Google Gemini & Gemini Flash are topping the evals leaderboards, in GA(X)* Gemini Flash finetuning coming soon* This weeks Buzz (What I learned at WandB this week)* Sponsored a Mistral hackathon in Paris* We have an upcoming workshop in 2 parts - come learn with me* Vision & Video* LLama3-V - Sota OSS VLM (X, Github)* Voice & Audio* Cartesia AI - super fast SSM based TTS with very good sounding voices (X, Demo)* Tools & Hardware* Jina Reader (https://jina.ai/reader/) * Co-Hosts and Guests* Rodrigo Liang (@RodrigoLiang) & Anton McGonnell (@aton2006) from SambaNova* Itamar Friedman (@itamar_mar) Codium* Arjun Desai (@jundesai) - Cartesia* Nisten Tahiraj (@nisten) - Cohost* Wolfram Ravenwolf (@WolframRvnwlf)* Eric Hartford (@erhartford)* Maziyar Panahi (@MaziyarPanahi)Scale SEAL leaderboards (Leaderboard)Scale AI has announced their new initiative, called SEAL leaderboards, which aims to provide yet another point of reference in how we understand frontier models and their performance against each other. We've of course been sharing LMSys arena rankings here, and openLLM leaderboard from HuggingFace, however, there are issues with both these approaches, and Scale is approaching the measuring in a different way, focusing on very private benchmarks and dataset curated by their experts (Like Riley Goodside) The focus of SEAL is private and novel assessments across Coding, Instruction Following, Math, Spanish and more, and the main reason they keep this private, is so that models won't be able to train on these benchmarks if they leak to the web, and thus show better performance due to data contamination. They are also using ELO scores (Bradley-Terry) and I love this footnote from the actual website: "To ensure leaderboard integrity, we require that models can only be featured the FIRST TIME when an organization encounters the prompts"This means they are taking the contamination thing very seriously and it's great to see such dedication to being a trusted source in this space. Specifically interesting also that on their benchmarks, GPT-4o is not better than Turbo at coding, and definitely not by 100 points like it was announced by LMSys and OpenAI when they released it! Gemini 1.5 Flash (and Pro) in GA and showing impressive performance As you may remember from my Google IO recap, I was really impressed with Gemini Flash, and I felt that it went under the radar for many folks. Given it's throughput speed, 1M context window, and multimodality and price tier, I strongly believed that Google was onto something here. Well this week, not only was I proven right, I didn't actually realize how right I was 🙂 as we heard breaking news from Logan Kilpatrick during the show, that the models are now in GA, and that Gemini Flash gets upgraded to 1000 RPM (requests per minute) and announced that finetuning is coming and will be free of charge! Not only with finetuning won't cost you anything, inference on your tuned model is going to cost the same, which is very impressive. There was a sneaky price adjustment from the announced pricing to the GA pricing that upped the pricing by 2x on output tokens, but even despite that, Gemini Flash with $0.35/1MTok for input and $1.05/1MTok on output is probably the best deal there is right now for LLMs of this level. This week it was also confirmed both on LMsys, and on Scale SEAL leaderboards that Gemini Flash is a very good coding LLM, beating Claude Sonnet and LLama-3 70B! SambaNova + Groq competing at 1000T/s speedsWhat a week for inference speeds! SambaNova (an AI startup with $1.1B in investment from Google Ventures, Intel Capital, Samsung, Softbank founded in 2017) has announced that they broke the 1000T/s inference barrier on Llama-3-8B in full precision mode (suing their custom hardware called RDU (reconfigurable dataflow unit)As you can see, this is incredible fast, really, try it yourself here. Seeing this, the folks at Groq, who had the previous record on super fast inference (as I reported just in February) decided to not let this slide, and released an incredible 20% improvement on their own inference of LLama-3-8B, getting to 1200Ts, showing that they are very competitive. This bump in throughput is really significant, many inference providers that use GPUs, and not even hitting 200T/s, and Groq improved their inference by that amount within 1 day of being challenged. I had the awesome pleasure to have Rodrigo the CEO on the show this week to chat about SambaNova and this incredible achievement, their ability to run this in full precision, and future plans, so definitely give it a listen. This weeks Buzz (What I learned with WandB this week)This week was buzzing at Weights & Biases! After co-hosting a Hackathon with Meta a few weeks ago, we cohosted another Hackathon, this time with Mistral, in Paris. (where we also announced our new integration with their Finetuning!)The organizers Cerebral Valley have invited us to participate and it was amazing to see the many projects that use WandB and Weave in their finetuning presentations, including a friend of the pod Maziyar Panahi who's team nabbed 2nd place (you can read about their project here) 👏Also, I'm going to do a virtual workshop together with my colleague Anish, about prompting and building evals, something we know a thing or two about, it's free and I would very much love to invite you to register and learn with us! Cartesia AI (try it)Hot off the press, we're getting a new Audio TTS model, based on the State Space model architecture (remember Mamba?) from a new startup called Cartesia AI, who aim to bring real time intelligence to on device compute! The most astonishing thing they released was actually the speed with which they model starts to generate voices, under 150ms, which is effectively instant, and it's a joy to play with their playground, just look at how fast it started generating this intro I recorded using their awesome 1920's radio host voiceCo-founded by Albert Gu, Karan Goel and Arjun Desai (who joined the pod this week) they have shown incredible performance but also showed that transformer alternative architectures like SSMs can really be beneficial for audio specifically, just look at this quote!On speech, a parameter-matched and optimized Sonic model trained on the same data as a widely used Transformer improves audio quality significantly (20% lower perplexity, 2x lower word error, 1 point higher NISQA quality).With lower latency (1.5x lower time-to-first-audio), faster inference speed (2x lower real-time factor) and higher throughput (4x)In Open Source news: Mistral released Codestral 22B - their flagship code model with a new non commercial licenseCodestral is now available under the new Mistral license for non-commercial R&D use. With a larger context window of 32K, Codestral outperforms all other models in RepoBench, a long-range evaluation for code generation. Its fill-in-the-middle capability is favorably compared to DeepSeek Coder 33B. Codestral is supported in VSCode via a plugin and is accessible through their API, Le Platforme, and Le Chat. HuggingFace Chat with tool support (X, demo)This one is really cool, HF added Cohere's Command R+ with tool support and the tools are using other HF spaces (with ZeroGPU) to add capabilities like image gen, image editing, web search and more! LLM 360 - K2 65B - fully transparent and reproducible (X, Paper, HF, WandB)The awesome team at LLM 360 released K2 65B, which is an open source model that comes very close to LLama70B on benchmarks, but the the most important thing, is that they open source everything, from code, to datasets, to technical write-ups, they even open sourced their WandB plots 👏 This is so important to the open source community, that we must highlight and acknowledge the awesome effort from LLM360 ai of doing as much open source! Tools - Jina readerIn the tools category, while we haven't discussed this on the pod, I really wanted to highlight Jina reader. We've had Bo from Jina AI talk to us about Embeddings in the past episodes, and since then Jina folks released this awesome tool that's able to take any URL and parse it in a nice markdown format that's very digestable to LLMs. You can pass any url, and it even does vision understanding! And today they released PDF understanding as well so you can pass the reader PDF files and have it return a nicely formatted text! The best part, it's free! (for now at least!)And that’s a wrap for today, see you g
Hello hello everyone, this is Alex, typing these words from beautiful Seattle (really, it only rained once while I was here!) where I'm attending Microsoft biggest developer conference BUILD. This week we saw OpenAI get in the news from multiple angles, none of them positive and Microsoft clapped back at Google from last week with tons of new AI product announcements (CoPilot vs Gemini) and a few new PCs with NPU (Neural Processing Chips) that run alongside CPU/GPU combo we're familiar with. Those NPUs allow for local AI to run on these devices, making them AI native devices! While I'm here I also had the pleasure to participate in the original AI tinkerers thanks to my friend Joe Heitzberg who operates and runs the aitinkerers.org (of which we are a local branch in Denver) and it was amazing to see tons of folks who listen to ThursdAI + read the newsletter and talk about Weave and evaluations with all of them! (Btw, one the left is Vik from Moondream, which we covered multiple times). I Ok let's get to the news: TL;DR of all topics covered: * Open Source LLMs * HuggingFace commits 10M in ZeroGPU (X)* Microsoft open sources Phi-3 mini, Phi-3 small (7B) Medium (14B) and vision models w/ 128K context (Blog, Demo)* Mistral 7B 0.3 - Base + Instruct (HF)* LMSys created a "hard prompts" category (X)* Cohere for AI releases Aya 23 - 3 models, 101 languages, (X)* Big CO LLMs + APIs* Microsoft Build recap - New AI native PCs, Recall functionality, Copilot everywhere * Will post a dedicated episode to this on Sunday* OpenAI pauses GPT-4o Sky voice because Scarlet Johansson complained* Microsoft AI PCs - Copilot+ PCs (Blog)* Anthropic - Scaling Monosemanticity paper - about mapping the features of an LLM (X, Paper)* Vision & Video* OpenBNB - MiniCPM-Llama3-V 2.5 (X, HuggingFace)* Voice & Audio* OpenAI pauses Sky voice due to ScarJo hiring legal counsel* Tools & Hardware* Humane is looking to sell (blog)Open Source LLMs Microsoft open sources Phi-3 mini, Phi-3 small (7B) Medium (14B) and vision models w/ 128K context (Blog, Demo)Just in time for Build, Microsoft has open sourced the rest of the Phi family of models, specifically the small (7B) and the Medium (14B) models on top of the mini one we just knew as Phi-3. All the models have a small context version (4K and 8K) and a large that goes up to 128K (tho they recommend using the small if you don't need that whole context) and all can run on device super quick. Those models have MIT license, so use them as you will, and are giving an incredible performance comparatively to their size on benchmarks. Phi-3 mini, received an interesting split in the vibes, it was really good for reasoning tasks, but not very creative in it's writing, so some folks dismissed it, but it's hard to dismiss these new releases, especially when the benchmarks are that great! LMsys just updated their arena to include a hard prompts category (X) which select for complex, specific and knowledge based prompts and scores the models on those. Phi-3 mini actually gets a big boost in ELO ranking when filtered on hard prompts and beats GPT-3.5 😮 Can't wait to see how the small and medium versions perform on the arena.Mistral gives us function calling in Mistral 0.3 update (HF)Just in time for the Mistral hackathon in Paris, Mistral has released an update to the 7B model (and likely will update the MoE 8x7B and 8x22B Mixtrals) with function calling and a new vocab. This is awesome all around because function calling is important for agenting capabilities, and it's about time all companies have it, and apparently the way Mistral has it built in matches the Cohere Command R way and is already supported in Ollama, using raw mode. Big CO LLMs + APIsOpen AI is not having a good week - Sky voice has paused, Employees complainOpenAI is in hot waters this week, starting with pausing the Sky voice (arguably the best most natural sounding voice out of the ones that launched) due to complains for Scarlett Johansson about this voice being similar to hers. Scarlett appearance in the movie Her, and Sam Altman tweeting "her" to celebrate the release of the incredible GPT-4o voice mode were all talked about when ScarJo has released a statement saying she was shocked when her friends and family told her that OpenAI's new voice mode sounds just like her. Spoiler, it doesn't really, and they hired an actress and have had this voice out since September last year, as they outlined in their blog following ScarJo complaint. Now, whether or not there's legal precedent here, given that Sam Altman reached out to Scarlet twice, including once a few days before the event, I won't speculate, but for me, personally, not only Sky doesn't sound like ScarJo, it was my favorite voice even before they demoed it, and I'm really sad that it's paused, and I think it's unfair to the actress who was hired for her voice. See her own statement: Microsoft Build - CoPilot all the thingsI have recorded a Built recap with Ryan Carson from Intel AI and will be posting that as it's own episode on Sunday, so look forward to that, but for now, here are the highlights from BUILD:* Copilot everywhere, Microsoft builds the CoPilot as a platform* AI native laptops with NPU chips for local AI * Recall an on device AI that let's you search through everything you saw or typed with natural language* Github Copilot Workspace + Extensions * Microsoft stepping into education with sponsoring Khan Academy free for all teaches in the US* Copilot Team member and Agent - Copilot will do things proactively as your team member* GPT-4o voice mode is coming to windows and to websites! Hey, if you like reading this, can you share with 1 friend? It’ll be an awesome way to support this pod/newsletter! Anthropic releases the Scaling Monosemanticity paperThis is quite a big thing that happened this week for Mechanistic Interpretability and Alignment, with Anthropic releasing a new paper and examples of their understanding of what LLM "thinks". They have done incredible work in this area, and now they have scaled it up all the way to production models like Claude Haiku, which shows that this work can actually understand which "features" are causing which tokens to output. In the work they highlighted features such as "deception", "bad code" and even a funny one called "Golden Gate bridge" and showed that clamping these features can affect the model outcomes. One these features have been identified, they can be turned on or off with various levels of power, for example they turned up the Golden Gate Bridge feature up to the maximum, and the model thought it was the Golden Gate bridge. While a funny example, they also found features for racism, bad / wrong code, inner conflict, gender bias, sycophancy and more, you can play around with some examples here and definitely read the full blog if this interests you, but overall it shows incredible promise in alignment and steer-ability of models going forward on large scale This weeks Buzz (What I learned with WandB this week)I was demoing Weave all week long in Seattle, first at the AI Tinkerers event, and then at MSFT BUILD. They had me record a pre-recorded video of my talk, and then have a 5 minute demo on stage, which (was not stressful at all!) so here's the pre-recorded video that turned out really good! Also, we're sponsoring the Mistral Hackathon this weekend in Paris, so if you're in EU and want to hack with us, please go, it's hosted by Cerebral Valley and HuggingFace and us → VisionPhi-3 mini Vision In addition to Phi-3 small and Phi-3 Medium, Microsoft released Phi-3 mini with vision, which does an incredible job understanding text and images! (You can demo it right here)Interestingly, the Phi-3 mini with vision has 128K context window which is amazing and even beats Mistral 7B as a language model! Give it a tryOpenBNB - MiniCPM-Llama3-V 2.5 (X, HuggingFace, Demo)Two state of the art vision models in one week? well that's incredible. A company I haven't heard of OpenBNB have released MiniCPM 7B trained on top of LLama3 and they claim that they outperform the Phi-3 visionThey claim that it has GPT-4 vision level performance and achieving an 700+ score on OCRBench, surpassing proprietary models such as GPT-4o, GPT-4V-0409, Qwen-VL-Max and Gemini ProIn my tests, Phi-3 performed a bit better, I showed both the same picture, and Phi was more factual on the hard prompts: Phi-3 Vision:And that's it for this week's newsletter, look out for the Sunday special full MSFT Build recap and definitely give the whole talk a listen, it's full of my co-hosts and their great analysis of this weeks events! This is a public episode. If you’d like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
Wow, holy s**t, insane, overwhelming, incredible, the future is here!, "still not there", there are many more words to describe this past week. (TL;DR at the end of the blogpost)I had a feeling it's going to be a big week, and the companies did NOT disappoint, so this is going to be a very big newsletter as well. As you may have read last week, I was very lucky to be in San Francisco the weekend before Google IO, to co-host a hackathon with Meta LLama-3 team, and it was a blast, I will add my notes on that in This weeks Buzz section. Then on Monday, we all got to watch the crazy announcements from OpenAI, namely a new flagship model called GPT-4o (we were right, it previously was im-also-a-good-gpt2-chatbot) that's twice faster, 50% cheaper (in English, significantly more so in other languages, more on that later) and is Omni (that's the o) which means it is end to end trained with voice, vision, text on inputs, and can generate text, voice and images on the output. A true MMIO (multimodal on inputs and outputs, that's not the official term) is here and it has some very very surprising capabilities that blew us all away. Namely the ability to ask the model to "talk faster" or "more sarcasm in your voice" or "sing like a pirate", though, we didn't yet get that functionality with the GPT-4o model, it is absolutely and incredibly exciting. Oh and it's available to everyone for free! That's GPT-4 level intelligence, for free for everyone, without having to log in!What's also exciting was how immediate it was, apparently not only the model itself is faster (unclear if it's due to newer GPUs or distillation or some other crazy advancements or all of the above) but that training an end to end omnimodel reduces the latency to incredibly immediate conversation partner, one that you can interrupt, ask to recover from a mistake, and it can hold a conversation very very well. So well, that indeed it seemed like, the Waifu future (digital girlfriends/wives) is very close to some folks who would want it, while we didn't get to try it (we got GPT-4o but not the new voice mode as Sam confirmed) OpenAI released a bunch of videos of their employees chatting with Omni (that's my nickname, use it if you'd like) and many online highlighted how thirsty / flirty it sounded. I downloaded all the videos for an X thread and I named one girlfriend.mp4, and well, just judge for yourself why: Ok, that's not all that OpenAI updated or shipped, they also updated the Tokenizer which is incredible news to folks all around, specifically, the rest of the world. The new tokenizer reduces the previous "foreign language tax" by a LOT, making the model way way cheaper for the rest of the world as wellOne last announcement from OpenAI was the desktop app experience, and this one, I actually got to use a bit, and it's incredible. MacOS only for now, this app comes with a launcher shortcut (kind of like RayCast) that let's you talk to ChatGPT right then and there, without opening a new tab, without additional interruptions, and it even can understand what you see on the screen, help you understand code, or jokes or look up information. Here's just one example I just had over at X. And sure, you could always do this with another tab, but the ability to do it without context switch is a huge win. OpenAI had to do their demo 1 day before GoogleIO, but even during the excitement about GoogleIO, they had announced that Ilya is not only alive, but is also departing from OpenAI, which was followed by an announcement from Jan Leike (who co-headed the superailgnment team together with Ilya) that he left as well. This to me seemed like a well executed timing to give dampen the Google news a bit. Google is BACK, backer than ever, Alex's Google IO recapOn Tuesday morning I showed up to Shoreline theater in Mountain View, together with creators/influencers delegation as we all watch the incredible firehouse of announcements that Google has prepared for us. TL;DR - Google is adding Gemini and AI into all it's products across workspace (Gmail, Chat, Docs), into other cloud services like Photos, where you'll now be able to ask your photo library for specific moments. They introduced over 50 product updates and I don't think it makes sense to cover all of them here, so I'll focus on what we do best."Google with do the Googling for you" Gemini 1.5 pro is now their flagship model (remember Ultra? where is that? 🤔) and has been extended to 2M tokens in the context window! Additionally, we got a new model called Gemini Flash, which is way faster and very cheap (up to 128K, then it becomes 2x more expensive)Gemini Flash is multimodal as well and has 1M context window, making it an incredible deal if you have any types of videos to process for example. Kind of hidden but important was a caching announcement, which IMO is a big deal, big enough it could post a serious risk to RAG based companies. Google has claimed they have a way to introduce caching of the LLM activation layers for most of your context, so a developer won't have to pay for repeatedly sending the same thing over and over again (which happens in most chat applications) and will significantly speed up work with larger context windows. They also mentioned Gemini Nano, a on device Gemini, that's also multimodal, that can monitor calls in real time for example for older folks, and alert them about being scammed, and one of the cooler announcements was, Nano is going to be baked into the Chrome browser. With Gemma's being upgraded, there's not a product at Google that Gemini is not going to get infused into, and while they counted 131 "AI" mentions during the keynote, I'm pretty sure Gemini was mentioned way more! Project Astra - A universal AI agent helpful in everyday lifeAfter a few of the announcements from Sundar, (newly knighted) Sir Demis Hassabis came out and talked about DeepMind research, AlphaFold 3 and then turned to project Astra.This demo was really cool and kind of similar to the GPT-4o conversation, but also different. I'll let you just watch it yourself: TK: project astra demoAnd this is no fake, they actually had booths with Project Astra test stations and I got to chat with it (I came back 3 times) and had a personal demo from Josh Woodward (VP of Labs) and it works, and works fast! It sometimes disconnects and sometimes there are misunderstandings, like when multiple folks are speaking, but overall it's very very impressive. If you remember the infamous video with the rubber ducky that was edited by Google and caused a major uproar when we found out? It's basically that, on steroids, and real and quite quite fast.Astra has a decent short term memory, so if you ask it where something was, it will remember, and Google cleverly used that trick to also show that they are working on augmented reality glasses with Astra built in, which would make amazing sense. Open Source LLMsGoogle open sourced PaliGemma VLMGiving us something in the open source department, adding to previous models like RecurrentGemma, Google has uploaded a whopping 116 different checkpoints of a new VLM called PaliGemma to the hub, which is a State of the Art vision model at 3B. It's optimized for finetuning for different workloads such as Visual Q&A, Image and short video captioning and even segmentation! They also mentioned that Gemma 2 is coming next month, will be a 27B parameter model that's optimized to run on a single TPU/GPU. Nous Research Hermes 2 Θ (Theta) - their first Merge!Collaborating with Charles Goddard from Arcee (the creators of MergeKit), Teknium and friends merged the recently trained Hermes 2 Pro with Llama 3 instruct to get a model that's well performant on all the tasks that LLama-3 is good at, while maintaining capabilities of Hermes (function calling, Json mode) Yi releases 1.5 with apache 2 licenseThe folks at 01.ai release Yi 1.5, with 6B, 9B and 34B (base and chat finetunes) Showing decent benchmarks on Math and Chinese, 34B beats LLama on some of these tasks while being 2x smaller, which is very impressiveThis weeks Buzz - LLama3 hackathon with MetaBefore all the craziness that was announced this week, I participated and judged the first ever Llama-3 hackathon. It was quite incredible, with over 350 hackers participating, Groq, Lambda, Meta, Ollama and others sponsoring and giving talks and workshops it was an incredible 24 hours at Shak 15 in SF (where Cerebral Valley hosts their hackathons) Winning hacks were really innovative, ranging from completely open source smart glasses for under 20$, to a LLM debate platform with an LLM judge on any moral issue, and one project that was able to jailbreak llama by doing some advanced LLM arithmetic. Kudos to the teams for winning, and it was amazing to see how many of them adopted Weave as their observability framework as it was really easy to integrate. Oh and I got to co-judge with the 🐐 of HuggingFaceThis is all the notes for this week, even though there was a LOT lot more, check out the TL;DR and see you here next week, which I'll be recording from Seattle, where I'll be participating in the Microsoft BUILD event, so we'll see Microsoft's answer to Google IO as well. If you're coming to BUILD, come by our booth and give me a high five! TL;DR of all topics covered: * OpenAI Announcements* GPT-4o* Voice mode* Desktop App* Google IO recap:* Google Gemini* Gemini 1.5 Pro: Available globally to developers with a 2-million-token context window, enabling it to handle larger and more complex tasks.* Gemini 1.5 Flash: A faster and less expensive version of Gemini, optimized for tasks requiring low latency.* Gemini Nano with Multimodality: An on-device model that processes various inputs like text, photos, audio, web content, and social videos.* Project Astra: An AI agent capable of understanding and responding to live video and audio in real-time.* Google Search* AI Overviews in Search Results: Provides quick summaries and relevant information for complex sear
Hey 👋 (show notes and links a bit below)This week has been a great AI week, however, it does feel like a bit "quiet before the storm" with Google I/O on Tuesday next week (which I'll be covering from the ground in Shoreline!) and rumors that OpenAI is not just going to let Google have all the spotlight!Early this week, we got 2 new models on LMsys, im-a-good-gpt2-chatbot and im-also-a-good-gpt2-chatbot, and we've now confirmed that they are from OpenAI, and folks have been testing them with logic puzzles, role play and have been saying great things, so maybe that's what we'll get from OpenAI soon?Also on the show today, we had a BUNCH of guests, and as you know, I love chatting with the folks who make the news, so we've been honored to host Xingyao Wang and Graham Neubig core maintainers of Open Devin (which just broke SOTA on Swe-Bench this week!) and then we had friends of the pod Tanishq Abraham and Parmita Mishra dive deep into AlphaFold 3 from Google (both are medical / bio experts).Also this week, OpenUI from Chris Van Pelt (Co-founder & CIO at Weights & Biases) has been blowing up, taking #1 Github trending spot, and I had the pleasure to invite Chris and chat about it on the show!Let's delve into this (yes, this is I, Alex the human, using Delve as a joke, don't get triggered 😉)TL;DR of all topics covered (trying something new, my Raw notes with all the links and bulletpoints are at the end of the newsletter)* Open Source LLMs* OpenDevin getting SOTA on Swe-Bench with 21% (X, Blog)* DeepSeek V2 - 236B (21B Active) MoE (X, Try It)* Weights & Biases OpenUI blows over 11K stars (X, Github, Try It)* LLama-3 120B Chonker Merge from Maxime Labonne (X, HF)* Alignment Lab open sources Buzz - 31M rows training dataset (X, HF)* xLSTM - new transformer alternative (X, Paper, Critique)* Benchmarks & Eval updates* LLama-3 still in 6th place (LMsys analysis)* Reka Core gets awesome 7th place and Qwen-Max breaks top 10 (X)* No upsets in LLM leaderboard* Big CO LLMs + APIs* Google DeepMind announces AlphaFold-3 (Paper, Announcement)* OpenAI publishes their Model Spec (Spec)* OpenAI tests 2 models on LMsys (im-also-a-good-gpt2-chatbot & im-a-good-gpt2-chatbot)* OpenAI joins Coalition for Content Provenance and Authenticity (Blog)* Voice & Audio* Udio adds in-painting - change parts of songs (X)* 11Labs joins the AI Audio race (X)* AI Art & Diffusion & 3D* ByteDance PuLID - new high quality ID customization (Demo, Github, Paper)* Tools & Hardware* Went to the Museum with Rabbit R1 (My Thread)* Co-Hosts and Guests* Graham Neubig (@gneubig) & Xingyao Wang (@xingyaow_) from Open Devin* Chris Van Pelt (@vanpelt) from Weights & Biases* Nisten Tahiraj (@nisten) - Cohost* Tanishq Abraham (@iScienceLuvr)* Parmita Mishra (@prmshra)* Wolfram Ravenwolf (@WolframRvnwlf)* Ryan Carson (@ryancarson)Open Source LLMsOpen Devin getting a whopping 21% on SWE-Bench (X, Blog)Open Devin started as a tweet from our friend Junyang Lin (on the Qwen team at Alibaba) to get an open source alternative to the very popular Devin code agent from Cognition Lab (recently valued at $2B 🤯) and 8 weeks later, with tons of open source contributions, >100 contributors, they have almost 25K stars on Github, and now claim a State of the Art score on the very hard Swe-Bench Lite benchmark beating Devin and Swe-Agent (with 18%)They have done so by using the CodeAct framework developed by Xingyao, and it's honestly incredible to see how an open source can catch up and beat a very well funded AI lab, within 8 weeks! Kudos to the OpenDevin folks for the organization, and amazing results!DeepSeek v2 - huge MoE with 236B (21B active) parameters (X, Try It)The folks at DeepSeek is releasing this huge MoE (the biggest we've seen in terms of experts) with 160 experts, and 6 experts activated per forward pass. A similar trend from the Snowflake team, just extended even longer. They also introduce a lot of technical details and optimizations to the KV cache.With benchmark results getting close to GPT-4, Deepseek wants to take the crown in being the cheapest smartest model you can run, not only in open source btw, they are now offering this model at an incredible .28/1M tokens, that's 28 cents per 1M tokens!The cheapest closest model in price was Haiku at $.25 and GPT3.5 at $0.5. This is quite an incredible deal for a model with 32K (128 in open source) context and these metrics.Also notable is the training cost, they claim that it took them 1/5 the price of what Llama-3 cost Meta, which is also incredible. Unfortunately, running this model locally a nogo for most of us 🙂I would mention here that metrics are not everything, as this model fails quite humorously on my basic logic testsLLama-3 120B chonker Merge from Maxime LaBonne (X, HF)We're covered Merges before, and we've had the awesome Maxime Labonne talk to us at length about model merging on ThursdAI but I've been waiting for Llama-3 merges, and Maxime did NOT dissapoint!A whopping 120B llama (Maxime added 50 layers to the 70B Llama3) is doing the rounds, and folks are claiming that Maxime achieved AGI 😂 It's really funny, this model, is... something else.Here just one example that Maxime shared, as it goes into an existential crisis about a very simple logic question. A question that Llama-3 answers ok with some help, but this... I've never seen this. Don't forget that merging has no additional training, it's mixing layers from the same model so... we still have no idea what Merging does to a model but... some brain damange definitely is occuring.Oh and also it comes up with words!ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.Big CO LLMs + APIsOpen AI publishes Model Spec (X, Spec, Blog)OpenAI publishes and invites engagement and feedback for their internal set of rules for how their models should behave. Anthropic has something similar with Constitution AI.I specifically liked the new chain of command (Platform > Developer > User > Tool) rebranding they added to the models, making OpenAI the Platform, changing "system" prompts to "developer" and having user be the user. Very welcome renaming and clarifications (h/t Swyx for his analysis)Here are a summarized version of OpenAI's new rules of robotics (thanks to Ethan Mollic)* follow the chain of command: Platform > Developer > User > Tool* Comply with applicable laws* Don't provide info hazards* Protect people's privacy* Don't respond with NSFW contentsVery welcome effort from OpenAI, showing this spec in the open and inviting feedback is greately appreciated!This comes on top of a pretty big week for OpenAI, announcing an integration with Stack Overflow, Joining the Coalition for Content Provenance and Authenticity + embedding watermarks in SORA and DALL-e images, telling us they have built a classifier that detects AI images with 96% certainty!im-a-good-gpt2-chatbot and im-also-a-good-gpt2-chatbotFollowing last week gpt2-chat mystery, Sam Altman trolled us with this tweetAnd then we got 2 new models on LMSys, im-a-good-gpt2-chatbot and im-also-a-good-gpt2-chatbot, and the timeline exploded with folks trying all their best logic puzzles on these two models trying to understand what they are, are they GPT5? GPT4.5? Maybe a smaller version of GPT2 that's pretrained on tons of new tokens?I think we may see the answer soon, but it's clear that both these models are really good, doing well on logic (better than Llama-70B, and sometimes Claude Opus as well)And the speculation is pretty much over, we know OpenAI is behind them after seeing this oopsie on the Arena 😂you can try these models as well, they seem to be very favored in the random selection of models, but they show up only in battle mode so you have to try a few times https://chat.lmsys.org/Google DeepMind announces AlphaFold3 (Paper, Announcement)Developed by DeepMind and IsomorphicLabs, AlphaFold has previously predicted the structure of every molecule known to science, and now AlphaFold 3 was announced which can now predict the structure of other biological complexes as well, paving the way for new drugs and treatments.What's new here, is that they are using diffusion, yes, like Stable Diffusion, starting with noise and then denoising to get a structure, and this method is 50% more accurate than existing methods.If you'd like more info about this very important paper, look no further than the awesome 2 minute paper youtube, who did a thorough analysis here, and listen to the Isomorphic Labs podcast with Weights & Biases CEO Lukas on Gradient DissentThey also released AlphaFold server, a free research tool allowing scientists to access these capabilities and predict structures for non commercial use, however it seems that it's somewhat limited (from a conversation we had with a researcher on stage)This weeks Buzz (What I learned with WandB this week)This week, was amazing for Open Source and Weights & Biases, not every week a side project from a CIO blows up on... well everywhere. #1 trending on Github for Typescript and 6 overall, OpenUI (Github) has passed 12K stars as people are super excited about being able to build UIs with LLms, but in the open source.I had the awesome pleasure to host Chris on the show as he talked about the inspiration and future plans, and he gave everyone his email to send him feedback (a decision which I hope he doesn't regret 😂) so definitely check out the last part of the show for that.Meanwhile here's my quick tutorial and reaction about OpenUI, but just give it a try here and build something cool!VisionI was shared some news but respecting the team I decided not to include it in the newsletter ahead of time, but expect open source to come close to GPT4-V next week 👀Voice & Audio11 Labs joins the AI music race (X)Breaking news from 11Labs, that happened during the show (but we didn't notice) is that they are stepping into the AI Music scene and
Hey 👋 Look it May or May not be the first AI newsletter you get in May, but it's for sure going to be a very information dense one. As we had an amazing conversation on the live recording today, over 1K folks joined to listen to the first May updates from ThursdAI. As you May know by now, I just love giving the stage to folks who are the creators of the actual news I get to cover from week to week, and this week, we had again, 2 of those conversations. First we chatted with Piotr Padlewski from Reka, the author on the new Vibe-Eval paper & Dataset which they published this week. We've had Yi and Max from Reka on the show before, but it was Piotr's first time and he was super super knowledgeable, and was really fun to chat with. Specifically, as we at Weights & Biases launch a new product called Weave (which you should check out at https://wandb.me/weave) I'm getting more a LOT more interested in Evaluations and LLM scoring, and in fact, we started the whole show today with a full segment on Evals, Vibe checks and covered a new paper from Scale about overfitting. The second deep dive was with my friend Idan Gazit, from GithubNext, about the new iteration of Github Copilot, called Copilot Workspace. It was a great one, and you should definitely give that one a listen as wellTL;DR of all topics covered + show notes * Scores and Evals* No notable changes, LLama-3 is still #6 on LMsys* gpt2-chat came and went (in depth chan writeup)* Scale checked for Data Contamination on GSM8K using GSM-1K (Announcement, Paper)* Vibes-Eval from Reka - a set of multimodal evals (Announcement, Paper, HF dataset)* Open Source LLMs * Gradient releases 1M context window LLama-3 finetune (X)* MaziyarPanahi/Llama-3-70B-Instruct-DPO-v0.4 (X, HF)* Nous Research - Hermes Pro 2 - LLama 3 8B (X, HF)* AI Town is running on Macs thanks to Pinokio (X)* LMStudio releases their CLI - LMS (X, Github)* Big CO LLMs + APIs* Github releases Copilot Workspace (Announcement)* AI21 - releases Jamba Instruct w/ 256K context (Announcement)* Google shows Med-Gemini with some great results (Announcement)* Claude releases IOS app and Team accounts (X)* This weeks Buzz* We're heading to SF to sponsor the biggest LLama-3 hackathon ever with Cerebral Valley (X)* Check out my video for Weave our new product, it's just 3 minutes (Youtube)* Vision & Video* Intern LM open sourced a bunch of LLama-3 and Phi based VLMs (HUB)* And they are MLXd by the "The Bloke" of MLX, Prince Canuma (X)* AI Art & Diffusion & 3D* ByteDance releases Hyper-SD - Stable Diffusion in a single inference step (Demo)* Tools & Hardware* Still haven't open the AI Pin, and Rabbit R1 just arrived, will open later today* Co-Hosts and Guests* Piotr Padlewski (@PiotrPadlewski) from Reka AI* Idan Gazit (@idangazit) from Github Next* Wing Lian (@winglian)* Nisten Tahiraj (@nisten)* Yam Peleg (@yampeleg)* LDJ (@ldjconfirmed)* Wolfram Ravenwolf (@WolframRvnwlf)* Ryan Carson (@ryancarson)Scores and EvaluationsNew corner in today's pod and newsletter given the focus this week on new models and comparing them to existing models.What is GPT2-chat and who put it on LMSys? (and how do we even know it's good?)For a very brief period this week, a new mysterious model appeared on LMSys, and was called gpt2-chat. It only appeared on the Arena, and did not show up on the leaderboard, and yet, tons of sleuths from 4chan to reddit to X started trying to figure out what this model was and wasn't. Folks started analyzing the tokenizer, the output schema, tried to get the system prompt and gauge the context length. Many folks were hoping that this is an early example of GPT4.5 or something else entirely. It did NOT help that uncle SAMA first posted the first tweet and then edited it to remove the - and it was unclear if he's trolling again or foreshadowing a completely new release or an old GPT-2 but retrained on newer data or something. The model was really surprisingly good, solving logic puzzles better than Claude Opus, and having quite amazing step by step thinking, and able to provide remarkably informative, rational, and relevant replies. The average output quality across many different domains places it on, at least, the same level as high-end models such as GPT-4 and Claude Opus.Whatever this model was, the hype around it made LMSYS add a clarification to their terms and temporarily take off the model now. And we're waiting to hear more news about what it is. Reka AI gives us Vibe-Eval a new multimodal evaluation dataset and score (Announcement, Paper, HF dataset)Reka keeps surprising, with only 20 people in the company, their latest Reka Core model is very good in multi modality, and to prove it, they just released a new paper + a new method of evaluating multi modal prompts on VLMS (Vision enabled Language Models) Their new Open Benchmark + Open Dataset is consistent of this format: And I was very happy to hear from one of the authors on the paper @PiotrPadlewski on the pod, where he mentioned that they were trying to create a dataset that was going to be very hard for their own model (Reka Core) and just decided to keep evaluating other models on it. They had 2 main objectives : (i) vibe checking multimodal chat models for day-to-day tasks and (ii) deeply challenging and probing the capabilities of present frontier models. To this end, the hard set contains > 50% questions that all frontier models answer incorrectlyChatting with Piotr about it, he mentioned that not only did they do a dataset, they actually used Reka Core as a Judge to score the replies from all models on that dataset and found that using their model in this way roughly correlates to non-expert human judgement! Very very interesting stuff. The "hard" set is ... well hard! Piotr concluded that if folks want to do research, they will provide free API access to Reka for that, so hit them up over DMs if you want to take this eval for a spin on your new shiny VLM (or indeed verify the metrics they put up) Scale tests for eval dataset contamination with GSM-1K (Announcement, Paper)Scale.ai is one of the most prominent companies in AI you may never have heard of, they are valued at $13B dollars and have pivoted from data processing for autonomous vehicles to being the darling of the government, with agreements from the DoD for data pipeline and evaluation for US Military. They have released a new paper as well, creating (but not releasing) a new dataset that matches the GSM8K (Grade School Math) dataset and evaluation that many frontier companies love to showcase in their release benchmarks with some surprising results! So Scale folks created (but not released) a dataset called GSK 1K, which tracks and is similar to the public GSM-8K dataset, and tested a bunch of existing models on their new one, to see the correlation, and if the different was very stark, assume that some models overfitted (or even had their dataset contaminated) on the publicly available GSM8K. On one end, models like Mistral or Phi do up to 10% worse on GSM1k compared to GSM8k. On the other end, models like Gemini, Claude, or GPT show basically no signs of being overfit.The author goes on to say that overfitting doesn't necessarily mean it's a bad model, and highlights Phi-3 which has a 10% difference on their new GSK-1K score compared to GSM-8K, but still answers 68% of their dataset, while being a tiny 3.8B parameter model. It seems that Scale is now stepping into the Evaluation game and have noticed how much interest there is in actually understanding how models perform, and are stepping into this game, by building (but not releasing so they don't leak) datasets. Jim Fan tweet (and Scale CEO Alex Wang QT) seem to agree that this is the right positioning for Scale (as they don't have models of their own and so can be neutral like Moody's)Open Source LLMs LLama-3 gets 1M context window + Other LLama-3 newsIn the second week of LLama-3 corner, we are noticing a significant ramp in all things Llama-3, first with the context length. The same folks from last week, Gradient, have spend cycles and upscaled/stretched LLama-3 to a whopping 1 million tokens in the context window (Llama-3 8B Gradient Instruct 1048k), with a very decent Niddle in the Haystack result. The main problem? Transformers have quadratic attention scaling issues for longer context, so this isn't something that you'd be able to run on your mac (nay, on your cluster) any time soon, and it's almost only theoretical at this point. The upside? We had Wing Lian (from Axolotl) on the show, and he talked about a new method called LoRD (which is now part of MergeKit) which is a way to extract Loras from models. Think of it as LLM arithmetic, you take the base model (llama-3 in this case) and the finetune (Llama-3 8B Gradient Instruct 1048k) and simple run a command like so: mergekit-extract-lora llama-3-8B-gradient-instruct-1048K llama-3-8B just-the-context-lora [--no-lazy-unpickle] --rank=desired_rankAnd boom, in theory, you have a tiny LoRA file that's extracted that is only the difference between these two models, the base and it's finetune. It's really exciting stuff to be able to do brain surgery on these models and extract only one specific essence! First LLama-3 finetunes that beat the instruct version Folks and Nous research give us a new Hermes-Pro on top of Llama-8B (X, HF) that is beating the llama-3 instruct on benchmarks, which is apparently very hard to do, given that Meta created a LOT of human labeled instructions (10M or so) and gave us a really really good instruct model. Nous Hermes 2 pro is also giving Llama-3 additional superpowers like function calling and tool use, specifically mentioning that this is the model to use if you do any type of agentic stuffThis new version of Hermes maintains its excellent general task and conversation capabilities - but also excels at Function Calling, JSON Structured Outputs, and has improved on several other metrics as well, scoring a 90% on our function calling e
Hey hey folks, happy ThursdAI 🎉 Not a lot of house-keeping here, just a reminder that if you're listening or reading from Europe, our European fullyconnected.com conference is happening in May 15 in London, and you're more than welcome to join us there. I will have quite a few event updates in the upcoming show as well. Besides this, this week has been a very exciting one for smaller models, as Microsoft teased and than released Phi-3 with MIT license, a tiny model that can run on most macs with just 3.8B parameters, and is really punching above it's weights. To a surprising and even eyebrow raising degree! Let's get into it 👇ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.TL;DR of all topics covered: * Open Source LLMs * Microsoft open sources Phi-3 (X, HF)* LLama3 70B top5 (no top 6) on LMsys (LMsys Arena)* Snowflake open sources Arctic - A massive hybrid MoE (X, Try it, HF)* Evolutionary Model merges support in MergeKit (Blog)* Llama-3 8B finetunes roundup - Longer Context (128K) and Dolphin & Bagel Finetunes* HuggingFace FINEWEB - a massive 45TB (the GPT4 of datasets) and 15T tokens high quality web data dataset (HF)* Cohere open sourced their chat interface (X)* Apple open sources OpenElm 4 models + training library called corenet (HF, Github, Paper)* Big CO LLMs + APIs* Google Gemini 1.5 pro is #2 on LMsys arena * Devin is now worth 2BN and Perplexity is also a Unicorn * A new comer called Augment (backed by Eric Schmidt) is now coming out of stealth (X)* Vision & Video* Adobe releases VideoGigaGAN - high quality upscaler with temporal consistency (paper)* TLDraw autocomplete UI demo (X)* This Weeks Buzz - What I learned in WandB this week* Joe Spisak talk about Llama3 on Stage at WandB Fully connected (Full Talk, TLDR)* Voice & Audio* Play.ai (previously play.ht) releases conversational Voice AI platform (X)* AI Art & Diffusion & 3D* IMGsys.org- like LMsys but for image generation model + leaderboard from FAL (try it)* Tools & Hardware* Rabbit R1 release party & no shipping update in sight* I'm disillusioned about my AI Pin and will return itOpen Source LLMs Llama-3 1 week-aversary 🎂 - Leaderboard ranking + finetunes Well, it's exactly 1 week since we got Llama-3 from Meta and as expected, the rankings show a very very good story. (also it was downloaded over 1.2M times and already has 600 derivatives on HuggingFace) Just on Monday, Llama-3 70B (the bigger version) took the incredible 5th place (now down to 6th) on LMSys, and more surprising, given that the Arena now has category filters (you can filter by English only, Longer chats, Coding etc) if you switch to English Only, this model shows up 2nd and was number 1 for a brief period of time. So just to sum up, an open weights model that you can run on most current consumer hardware is taking over GPT-4-04-94, Claude Opus etc' This seems dubious, because well, while it's amazing, it's clearly not at the level of Opus/Latest GPT-4 if you've used it, in fact it fails some basic logic questions in my tests, but it's a good reminder that it's really hard to know which model outperforms which and that the arena ALSO has a bias, of which people are using it for example and that evals are not a perfect way to explain which models are better. However, LMsys is a big component of the overall vibes based eval in our community and Llama-3 is definitely a significant drop and it's really really good (even the smaller one) One not so surprising thing about it, is that the Instruct version is also really really good, so much so, that the first finetunes of Eric Hartfords Dolphin (Dolphin-2.8-LLama3-70B) is improving just a little bit over Meta's own instruct version, which is done very well. Per Joe Spisak (Program Manager @ Meta AI) chat at the Weights & Biases conference last week (which you can watch below) he said "I would say the magic is in post-training. That's where we are spending most of our time these days. Uh, that's where we're generating a lot of human annotations." and they with their annotation partners, generated up to 10 million annotation pairs, both PPO and DPO and then did instruct finetuning. So much so that Jeremy Howard suggests to finetune their instruct version rather than the base model they released.We also covered that despite the first reactions to the 8K context window, the community quickly noticed that extending context window for LLama-3 is possible, via existing techniques like Rope scaling, YaRN and a new PoSE method. Wing Lian (Maintainer of Axolotl finetuneing library) is stretching the model to almost 128K context window and doing NIH tests and it seems very promising! Microsoft releases Phi-3 (Announcement, Paper, Model)Microsoft didn't really let Meta take the open models spotlight, and comes with an incredible report and follow up with a model release that's MIT licened, tiny (3.8B parameters) and performs very very well even against Llama-3 70B. Phi is a set of models from Microsoft that train on synthetic high-quality dataset modeled after textbooks-is-all-you-need/TinyStories approach. The chart is quite incredible, the smallest (mini) Phi-3 is beating Llama-3-8B AND Mixtral on MMLU scores, BigBench and Humaneval. Again to simplify, this TINY 3.8B model, half the size of 1 Mixtral expert, beats Mixtral and newly released Llama-3-8B on most benchmark, not to mention GPT-3.5! It's honestly quite a crazy chart to look at, which raises the question, did this model train on these benchmarks? 🤔 I still haven't seen definitive proof that the folks at Microsoft trained on any benchmarks data, I did see engagement from them and a complete denial, however we did see a few attempts at using Phi-3 and the quantized versions and the wrong end token formatting seem to be very prevalent in shaping the early opinion that this model performance is detached from it's very high scoring. Not to mention that model being new, there's confusion about how to use it, see thread from Anton Bacaj about HuggingFace potentially using the wrong end token to finish conversations. Now to an actual performance of this tiny model, I asked it a simple logic based question that trips many models even ones good with logic (Opus and GPT-4 answer it correctly usually) and it performed very well (here a comparison with LLama-3-70B which didn't do as well)Additionally, their tokenizer is very interesting, they have all these terms that receive a full token, things like function_list, calc, ghreview, ghissue, and others, which highlight some interesting potential use-cases they have planned for this set of models or give us a hint at it's training process and how come it's so very good. Snowflake open sources Arctic - a massive 480B MoE Hybrid with Apache 2 license (X, Try it, HF)Snowflake is a name I haven't yet used on ThursdAI and this field is getting crowded, but they just released something interesting (+ a LOT of open source, including training code, checkpoints, research insights etc')The thing I found most interesting is, the massive 128 experts MoE but also the Hybrid architecture. Not quite an MoE and definitely not a dense model. They claim to have found that training Many-but-condensed experts with more expert choices is working well for them based on DeepSpeed research. You can give this model a try here and I have, using the same 2 questions I had for Phi and LLama and found the model not that great at logic to be honest, but it was really fast considering the total size, so inference optimization for this type of architecture is definitely geared towards Enterprise (as well as training cost, they claim it cost just under $2 million dollars to train) Big CO LLMs + APIsNot a lot of super interesting things in this corner, besides Gemini 1.5 pro (the one with 1M context window) finally appearing in the Arena and taking the amazing #2 spot (pushing Llama-3 8B to number 6 on the same day it just appeared in there lol) This is very impressive, and I gotta wonder what happened with Gemini Ultra if pro with larger context beats it outright. It's indeed very good, but not THAT good if you use it om simple logic problems and don't use the whole context length. I suspect that we'll hear much more about their AI stuff during the upcoming Google IO (which I was invited to and am going to cover) Additionally, we've had quite a few AI Unicorns born, with Perplexity becoming a freshly mint Unicorn with an additional round of funding and Devin, the 6-month old agent startup getting to a 2 billion valuation 😮 This weeks Buzz (What I learned with WandB this week)It's been exactly 1 week since our conference in SF and since Joe Spisak by complete chance announced Meta LLama - 3 live on stage a few hours after it was officially announced. In this weeks buzz, I'm very happy to bring you that recording, as promised last week. I will also share that our newly announced new LLM observability tool Weave launched officially during the conference and it'll be my job to get you to use it 🙂 And shoutout to those in the ThursdAI community who already used and provided feedback, it's really helpful! AI Art & DiffusionThe fine folks at FAL.ai have launched the LMsys.org for images, and called it.... IMGsys.org 🙂 It's a adversarial arena with different image generators, all hosted on Fal I assume, that lets the user choose which images are "better" which is a vague term. But it's really fun, give it a try! Tools & HardwareRabbit R1 first impressionsWe finally got a tease of R1 from Rabbit, as the first customers started receiving this device (where's mine?? I didn't even get a tracking number) Based on the presentation (which I watched so you don't have to) the response time, which was one of the most talked about negative pieces of AI Pin seems very decent. We're going to see a lot of reviews, but I'm very excited about my Rabbit 👏 🐇 Apparently
Happy LLama 3 day folks! After a lot of rumors, speculations, and apparently pressure from the big Zuck himself, we finally can call April 18th, 2024, LLaMa 3 day! I am writing this, from a lobby of the Mariott hotel in SF, where our annual conference is happening called Fully Connected, and I recorded today's episode from my hotel room. I really wanna shout out how awesome it was to meet folks who are listeners of the ThursdAI pod and newsletter subscribers, participate in the events, and give high fives. During our conference, we had the pleasure to have Joe Spisak, the Product Director of LLaMa at Meta, to actually announce LLaMa3 on stage! It was so exhilarating, I was sitting in the front row, and then had a good chat with Joe outside of the show 🙌 The first part of the show was of course, LLaMa 3 focused, we had such a great time chatting about the amazing new 8B and 70B models we got, and salivating after the announced but not yet released 400B model of LLaMa 3 😮 We also covered a BUNCH of other news from this week, that was already packed with tons of releases, AI news and I was happy to share my experiences running a workshop a day before our conference, with focus on LLM evaluations. (If there's an interest, I can share my notebooks and maybe even record a video walkthrough, let me know in the comments) Ok let's dive in 👇 Happy LLama 3 day 🔥 The technical detailsMeta has finally given us what we're all waiting for, an incredibly expensive (2 clusters of 24K H100s over 15 Trillion tokens) open weights models, the smaller 8B one and the larger 70B one. We got both instruction fine tune and base models, which are great for finetuners, and worth mentioning that this is a dense model (not a mixture of experts, all the parameters are accessible for the model during inference) It is REALLY good at benchmarks, with the 7B model beating the previous (LLaMa 2 70B) on pretty much all benchmarks, and the new 70B is inching on the bigger releases from the past month or two, like Claude Haiku and even Sonnet! The only downsides are the 8K context window + non multimodality, but both are coming according to Joe Spisak who announced LLama3 on stage at our show Fully Connected 🔥 I was sitting in the front row and was very excited to ask him questions later! By the way, Joe did go into details they haven't yet talked about pulblicly (see? I told you to come to our conference! and some of you did!) and I've been live-tweeting his whole talk + the chat outside with the "extra" spicy questions and Joes winks haha, you can read that thread hereThe additional infoMeta has also partnered with both Google and Bing (take that OpenAI) and inserted LLama 3 into the search boxes of Facebook, Instagram, Messenger and Whatsapp plus deployed it to a new product called meta.ai (you can try it there now) and is now serving LLama 3 to more than 4 Billion people across all of those apps, talk about compute cost! Llama 3 also has a new Tokenizer (that Joe encouraged us to "not sleep on") and a bunch of new security tools like Purple LLama and LLama Guard. PyTorch team recently released finetuning library called TorchTune is now supporting LLama3 finetuning natively out of the box as well (and integrates Wandb as it's first party experiment tracking tool) If you'd like more details, directly from Joe, I was live tweeting his whole talk, and am working at getting the slides from our team. We'll likely have a recording as well, will post it as soon as we have it. Here's a TL;DR (with my notes for the first time) of everything else we talked about, but given today is LLaMa day, and I still have to do fully connected demos, I will "open source" my notes and refer you to the podcast episode to hear more detail about everything else that happened today 🫡 TL;DR of all topics covered: * Meta releases LLama 3 -8B, 70B and later 400B (Announcement, Models, Try it, Run Locally)* Open Source LLMs * Meta LLama 3 8B, 70B and later 400B (X, Blog)* Trained 15T tokens! * 70B and 8B modes released + Instruction finetuning* 8K context length , not multi modal* 70B gets 82% on MMLU and 81.7% on HumanEval* 128K vocab tokenizer* Dense model not MoE* Both instruction tuned on human annotated datasets* Open Access* The model already uses RoPe * Bigxtral instruct 0.1 (Blog, Try it)* Instruct model of the best Apache 2 model around* Release a comparison chart that everyone started "fixing" * 🤖 Mixtral 8x22B is Mistral AI's latest open AI model, with unmatched performance and efficiency * 🗣 It is fluent in 5 languages: English, French, Italian, German, Spanish* 🧮 Has strong math and coding capabilities * 🧠 Uses only 39B parameters out of 141B total, very cost efficient* 🗜 Can recall info from large documents thanks to 64K token context window* 🆓 Released under permissive open source license for anyone to use* 🏆 Outperforms other open models on reasoning, knowledge and language benchmarks * 🌐 Has strong multilingual abilities, outperforming others in 4 languages* 🧪 Excellent basis for customization through fine-tuning* New Tokenizer from Mistral (Docs)* Focusing on Tool Use with tokens 🔥* WizardLM-2 8x22B, 70B and 7B (X, HF)* Released it and then pulled it back from HF and Github due to microsoft toxicity not passing* Big CO LLMs + APIs* OpenAI gives us Batch API + Assistants API v2 * Batch is 50% cost and win win win* Assistants API V2 - new RAG* new file search tool* up to 10,000 files per assistant* new vector store* Reka gives us Reka Core (X, Try)* Multimodal that understands video as well* 20 people team* Video understanding is very close to Gemini * 128K context * Core has strong reasoning abilities including for language, math and complex analysis.* 32 languages support * HuggingFace ios chat bot now * This weeks Buzz* Me + team led a workshop a day before the conference (Workshop Thread)* Fully Connected in SF was an incredible success, over 1000 AI attendies + Meta AI announcement on stage 🔥 * PyTorch new TorchTune finetuning library with first class WandB support (X)* Vision & Video* Microsoft VASA-1 animated avatars (X, Blog)* Amazing level of animation from 1 picture + Sound* Harry Potter portraits are here* They likely won't release this during Election year* Looks very good ,close to EMO but no code* 📺 Videos show faces speaking naturally with head movements and lip sync* 🔬 Researchers are exploring applications in education, accessibility and more* HuggingFace updates IDEFICS2 8B VLM (X, HF)* Apache 2 license* Competitive with 30B models* 12 point increase in VQAv2, 30 point increase in TextVQA (compared to Idefics 1)* > 10x fewer parameters than Idefics 1* Supports image resolution up to 980 x 980+* Better OCR capabilities (thanks to more than 6TB of OCR pre-training data)* Adobe shows Firefly video + SORA support (X)* Voice & Audio* Rewind AI is now Limitless (X)* New service & Brand name* Transcription to you * Hardware device that looks sleek * 100hours * Privacy support in cloud* AI Art & Diffusion & 3D* Stability - Stable Diffusion 3 is here * Available via API only* Partnered with Fireworks HQ for the release* Needs stability AI membership to use / access $$* Big step up in composition and notorious issues like hands, "AI faces" etc. (from * Seems to prefer simpler prompts.* Way more copyright-friendly. It's hard to get any kind of brands/logos. * Text is amazing.* Others* New AIrChat with amazing transcription is out, come join us in our AI corner there* Humane AI pin was almost killed by MKBHD review* Rabbit reviews incomingThat's all for this week, next week we have an amazing guest, see you then! 🫡 This is a public episode. If you’d like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
Comments
Top Podcasts
The Best New Comedy Podcast Right Now – June 2024The Best News Podcast Right Now – June 2024The Best New Business Podcast Right Now – June 2024The Best New Sports Podcast Right Now – June 2024The Best New True Crime Podcast Right Now – June 2024The Best New Joe Rogan Experience Podcast Right Now – June 20The Best New Dan Bongino Show Podcast Right Now – June 20The Best New Mark Levin Podcast – June 2024
United States