Transcript of 📅 ThursdAI - Aug 29 - AI Plays DOOM, Cerebras breaks inference records, Google gives new Geminis,...

I think it's time for us to get [Music] [Applause] [Music] started woo I pause it in the middle but that's fine uh welcome to Thursday I everyone this is August 29th my name is Alex volov uh today is Thursday which means this show comes to you live on Twitter as every Thursday for the past year and a half plus welcome to Thursday everyone I see a bunch of new faces in the yits but I also see many returning friends I also want to say hi to my friends here on stage welcome Ryan Welcome niston All right we have a very interesting show today because it was very interesting I Was preparing my notes yesterday as I always do trying to like go through a very full AI news week and for the first time my open source which is the main category of the news that I love on on Thursday I was not full of news which was like to me and I I sent it in the group DM I was like what this is new to me but then we did get a few news in the open source we do have a bunch of news in the big companies and apis there's been quite a few changes in the metrics and everything and also vision and video category is going to PO blow up this week there's like a new video model and there's also a breaking news thing that I got an insight into and also we have a chat today with folks that broke the fastest inference in the world again so if you guys remember we we've been tracking this we've talked with the folks from grock with a que when they came out with the fastest inference engine in the world we then covered somebody NOA that kind of beat gr and then gr hit them back again and this week the cerebras came out with their inference on their crazy big VAP Vaper scale chips and then we're going to talk with Ian and James today from cerebras and that's going to be interesting in the second hour of the show they're going to tell us how the hell they broke whatever crazy speed grock already had and and someo already had so that's going to be very interesting a deep conversation into how the hell they're running llama 3.1 won at 1,700 tokens per second something like insane so yeah so that's going to be a a great conversation as well and whatever breaking news is going to come on Thursday as well so I think yeah let's start with the tldr and then we're going to talk about open source and then everything else let's go let's do [Music] it this is the tldr everything that we've covered on Thursday I August 29th we had quite a show a a lot of guest speakers some breaking news as well we started the show talking about breaking news from news research first we started with talking about news drro distributed training algorithm that they released from imilla and a bunch of other folks distributed training allows folks companies to train llms in between different data centers in distributed way but the the new research they're showing shows significant reduction in the need that the required bandwidth essentially allows the training it's called drro it's really worth checking out uh additionally news also released a new data set called heris function calling data set is open sourced we had Interstellar ninja from news research who worked on this data set together with technium talk about how they took previous work in function calling and also work of their own went into this open source data set it's now starting to get used by companies like AMA for example and this is going to basically hopefully align the open source in industry on kind of a standardized format for function calling and this data set is going to be a huge help for open source fine-tuning communities giving their models function calling abilities uh we also covered that just generally open source is my favorite part usually of the show LinkedIn yes LinkedIn also has a research arm and they release something quite incredible it's called liger kernel it's a oneliner basically that improves training for LMS by 20% faster while also reducing the memory requirements by 60% I will repeat this because this is quite crazy collection of trior kernals now improveed trainings by 20% while reducing the memory impact by 60% it supports Flash attention and Pie torch it's already included into uh Wing lean's Axel Aral training fine tuning Library which is incredible and this is how I found out about this so shout out wingan uh we covered as well breaking news today we had dingyang Lin from the quen team as always when they release something it's big junyang announced that they have state-of-the-art in Vision in multiple benchmarks with quen 2vl they are beating even CH GPT on several benchmarks and the smaller versions of quen VL are now released in the Bachi 2 versions they now have newer capabilities like up to 20 minutes of video understanding junyang went deep into what it takes for these models to not only understand images but also understand video and it takes more than just like sequence of image by image and he went deep into What It Takes they also have a gentic capabilities there for potentially newer applications and it's now multilingual as well so qu VL is now open weighted and um definitely worth listening to this conversation with with junyang always fun so breaking news on Thursday I uh in the big companies LMS and apis we had a conversation with James Wang and Ian Milton from cerebras cerebrus launched the fastest AI in inance in the world they launched this week on top of their wafer scaled chip which is not a GPU it's a specific chip that inference service they launched comes after they have had a training service on those ships for a while now they have rebuilt from scratch and inference system on there and they're tapping out at the as the fastest inference for llama in the world around 450 tokens per second for Lama 70b and a crazy 1,870 tokens per second for llama 3.18 billion parameter so again that's uh 1,870 tokens uh per second it's basically instantaneous I really had um fun having Ian and James on the show we also had Early Access with w and biases to some of this apis and we did an independent analysis which I'll add to the show notes as well and we did verify that these crazy speeds are in fact what they say they are uh as well as other independent verification um that we saw as well we also covered that Google released a few updates to their Gemini offerings and new Gemini Pro the Gemini 1.5 Pro and Flash were released the new Gemini 1.5 Pro is now the second best LM according to LMC Serena it's number one in math and longer query up there together with chbt 40 and it's also number two and everything else basically uh beating all other models and it's backed up by different metrics that we saw including ader chat and uh big codebench and I'm very excited to try this out um personally their new flash model as well which we did like before it's now climbing to the charts from number 23 to number six uh which is also a new experimental version and also they announced a new Flash 8 billion parameter so now we're getting uh gima 1.5 Pro 1.5 Flash and 1.5 flash8 billion parameter uh then we talked about entropic entropic started publishing their system PRS with their model releases which is a very welcome thing and we talked about this and also one of the bigger things we were very happy to cover is that entropic artifacts are now available to all users artifacts is this feature in entopic where you ask Cloud to generate something uh Cloud will pop up a sidebar and uh it will actually generate a web View and will actually basically build an app and show you the app as it writes it out and so you can build mini apps and this artifact panel comes to the mobile app so now you can build those mini apps on the go as well apps like hey build me a tips calculator that splits the difference between seven people or build me a mini game of snake or build me whatever folks on stage here talked about the different apps that they've built this is now available to all users you don't have to be a paid user of entropic cloud to actually use artifacts the fact that it's on the mobile app as well it makes it very accessible really fun to play with and it's now available to all users definitely play with this as well uh we had some announcement from openi the information leaked uh a story where opene races to launch strawberry reasoning and a product called Orion based on strawberry reasoning uh this fall at some point we also had a tweet breaking news from uh Sam Alman that they've reached an agreement with the AI safety Institute they will share some models with them before for government safety and now we have some more breaking news that open I said that chat GPT has more than 200 million weekly active users and the usage has doubled since last year which is some updated news as well in this week's Buzz the category where I talk about and update you about everything that happens in we and basses besides telling you about different features that we built in to weave our product for llm observability I updated you that our hackathon is moving forward in September 21st 22nd in San Francisco it's going to be about llm as a judge we're going to build tooling and different judges approaches for llm as a judge feel free to join us on our haakon I'm going to be there I'm going to MC it's going to be a lot of fun we also have an upcoming course about rag Advanced rag techniques with cohere and weate and we also had some hints about upcoming releases from cir that are probably going to be very important for this course somebody in the comments hinted that some updates from coh here are coming but we have an upcoming course it's free and I'm going to add this to the show notes as well and then in the vision and video category we barely covered this but there's a new video model called Cog video x from ji it's a 5 billion parameter video model that runs basically in less than 10 GB of vram so on most consumer video cards this can run and it generates video not quite Sor but generates video diffusion and it's quite cool one of the coolest things this week from AI art video and diffusion was game engine game engine This research from Google deep mind that they took doom and they trained stable diffusion 1.4 yeah had the old one the original one on Doom imagery and then basically generated video that looks like somebody is playing Doom but all of the frames there are not Doom engine they're generated with stable diffusion fine-tuned for Doom and they look coherent and consistent and it looks like the game is actually is being played as though it's Doom but it's not it's stable diffusion generating Doom uh and it plays it at 20 frames per second and it looks like lassi jpeg compression and they said that the human Raiders are only slightly better than random at understanding whether or not this is an actual Doom game or this is a a rendered and some something's crazy there in AI which is crazy which is absolutely crazy also you know that I love talking about flux and how I'm putting my face in in like a Laura in flux I do this via F this service that offers to do this uh very cheaply and then our friends from F came on the show and said that they have a new trainer for Laura and and they also offered a coupon code so if you listen to the show you'd be able to do this uh and they offered a coupon for folks listeners to the show as well cpon is you can listen to the show and get it and you they released a new trainer that's significantly faster now it like literally does not take that much time at all to train anything with flux it could be your face it could be your cat's face probably better to use your cat's face it could be like a handbag that you love it could be anything so definitely train some luras and and posted in comments to Thursday I like I see some people do and it's awesome and we also had a last minute edition of magic dodev magic. deev is a company that trained a groundbreaking model with 100 million tokens in the contact window n fredman says they performed far better in evil than anything they tried before and they're using it to build an advanced AI programmer then can reason over entire Cod base and and have a long-term memory it's it sounds like magic because it's crazy so basically Daniel notman Daniel grass invested 100 million in the company a long time ago and now magic basically said that the Contex length that they have is 100 million tokens in the contact window they announced a new partnership with Google as well still no product but we're very excited to see where this is going but I think I think maybe yeah maybe it's time to get started with our favorite Corner open source [Music] open source AI Let's Get It Started Let's Get It Started With Open Source Interstellar welcome to the show hi Alex how are you good how are you man welcome intercell from news research first time on the show I believe yeah this is my first time but I've been listening to the podcast for a while now yeah so I appreciate you coming here and you're coming here with news I will not delay I will just say that you're from news research news research dear friends of the Pod for a long time way way before they became a company actually when it was like just a rack tag of people on the Discord and I have chatted with technium yesterday and he was like hinting about something that's going to come and get released today and I you were involved in this effort so how about you tell us what you guys just released today yeah so we released the function calling V1 data set today it was supposed to be released earlier this week but we had drro obviously which I think was the major news from news yeah about function calling Ian so we use function calling Ian for the rs2 Pro Models even the Theta models are trained with this data set so in the beginning we had this void of not having a open AI compatible function calling model and RMS being one of the preferred models so I collaborated with technium let's make a native agentic model like with structure output like Json mode and everything all the instructor Library stuff that Json new has been promoting with panig is all we need so we also serialize the pantic Json schema into the system prompt and we get the Json output as well as we have the function calling in the openai API standard format the other beautiful thing about this data set is we use XML prompting XML tags so at the time when we developed this data set CL was not even a soda model and XML prompting wasn't popular yet but I was like looking into the prompting to learn imitate some prompting schema for the function calling and decided to use XML tags for the tool call tags but stick with the Json schema of open AI to be compatible with it later yeah that's pretty much the story behind this data set that's so awesome first of all shout out to you for working on this I know for a long time and we've asked this before um some amount of function calling understanding in some of these models came from I believe John Durban with his Bagel data set and oror that I said before and it was very interesting to see some like almost emerging understanding of function calling but without like necessarily training on this or without cleaning this I believe that some goes to glaive as well from shout out to seil right tell me about glaive involvement in in some of this I believe that some work was related to this as well so we were looking for open source data sets to augment if possible and then glaive had some open source data set at the time in the open AI format right but they weren't really using certain tags that would make sense for us to include in the training data we augmented glaive data set also gorilla had some data set or models I think there was functionary forgetting maybe Nexus Raven was there but not much data sets were available fireworks had some eval data set that we used for evaluating our models in the beginning we also created evaluation framework for function calling in Jason mode and we sampled from glaive and fireworks for the evaluation data set I think somebody from lilac AI nickel Thor I think he helped us with clustering the glaves data set because glaes data is large but we wanted to sample the best and diverse set to mix with our data set so yeah I haven't talked about our data set how it was generated it's based on a curriculum so we use a diverse curriculum of like mobile apps and also industry SAS applications enterprise software right so basically use the jigs classification of Industries to create a function calling curriculum and given this curriculum we generate synthetically generate the function uh signature in the open AI schema and we actually use open AI models to call these functions first and then actually append the tags so that we can train the models with these tags and now these tags are actually part of the tokenizer config so they're actually added tokens and they're single tokens right and so yeah we also make the the function calls parallel by using multiple tags rather than single tags like M do um we just want to focus on parallelization and natively agentic nature of the model so we actually tra add these tokens that we use in the system promp for example we use tools token to indicate available tools for the model to use and then we use tool response tags for passing the tool results back to the model as part of the tool role which is follows the openi standard completely but then the tags are totally native to RMS models now they can be trained with other open source models can be trained with this format and the advantage of training with this format is now it's integrated into AMA it's being integrated into VM as we speak the pr is about to get merged uh I think by the end of this week the pr will be merged so we can use open AI function calling with Armes model on VM hogging face token Transformers you can use apply chat template so hugging face actually pushed a chat template to use chat template which is a ginta template into our models so we can use with them local Ai and other local AI model hosting service also integrated our format the advantage is there for using our format for open source training That's So Dope I I love the fact that first of all because it's open source the industry can of adopted this once the industry adopt this everybody basically wins and I love the fact that you guys are also building on top of other open source stols I know that lilac I don't believe is no longer open source I I think they've been absorbed into Data bricks I believe but at least some of it was open source but they definitely helped with getting from one open source to another shout out to glaive doai by the way folks who don't know who glaive is glaive will help you build custom models for yourself so definitely shout out to glaive for building and releasing their function calling data set as well so now there's work that uses glaive distills that into your data set but also it's not only data set you release you also released a kind of standardization in the open source of how new models will also use tools for this so that's quite incredible thank you for working on this thank you for coming and explaining this work as well while I've got you here could you give us like a five minute on Dro and what that means cuz you've been involved probably in at least some of the chats yeah I wasn't uh directly involved in drro as a as someone who is a proponent of like fated model training and we've been talking about distributed model training for a while and for open source folks who are trying to train models accumulating resources is like difficult especially given the sizes of the models now four or 5B model like it's really hard to train very few people can do it one of our friends Edward is here in the podcast he helped me get gpus in the beginning for testing function calling right so it's really awesome that now we can train models over the internet I cannot really talk much about the technical details I think imosa can tell you more but I think we optimize the communication intercommunication between gpus I think the idea about of being able to train models over the Internet with communication all these passing of the gradients and all these updates happening over the Internet is is pretty crazy and while big yeah I I know that and maybe n you can chime in here the the network aspect of just the data communication aspect of training models is so so important the bandwidth of the network that there's the whole Envy link thing and there's the network link but but there's also the ethernet people talk about which ones they need to choose within the data center uh themselves maybe Ryan you can maybe talk about this also when people build data center I think I heard like Uncle you don't talk about this they switch from envil link to ethernet or something to to improve within the same data data center that it's still important to have a very high throughput of data within the same data center uh folks with news research are talking about they've cracked the potential of doing training across the inter internet by reducing the requirements of network by 857 from 74 GB required to 86 megabytes required to pass this data something like crazy like this Ryan have you seen the dist stuff what what's your take on this yeah it's awesome and uh I love seeing them push this forward um yes at Intel we do our best to support Open Standards and obviously we're fans of ethernet versus single providers um I won't mention names but I think it's super cool that we're starting to see this happen o over the wire um so I love it I'm excited n Go ahead yeah so you can write like very good optimizers and and people can make code for training models on their own but doing it in a distributed way that's a whole other that's a whole other game you need an entirely different set of skills and it's pretty hard now distributor training has been done before so that was the pedals project and Alexander yeah Alexander Bor who is now in Georgia and works on the I think it's the hive mind thing and that's also what what pedals so that has been done before but it was just slow and a lot of that work has been geared more towards pre-training or towards data center or multi data center use it wasn't really geared towards I guess people like us just like open source researchers or developers say here and there or small teams that that was more geared towards Big projects so this work is very much needed because the bottleneck at the end of the day becomes the memory uh most of the chips that you see even the super efficient ones like if you see the Google tpus which are they put 4,000 or 8,000 chips in a single rack and it's supposed to be the most efficient thing when you look at energy uses uh 90% of that energy is still used towards the interconnects and this is the most efficient one you can get figuring this out to to reduce Network bandwidth in general it is extremely important and has exponential improvements when you're trading over the network so yeah so these kinds of efforts are are very important towards democratizing the the technology and just allowing communities to make their own bot or whatever the heck they want to do yeah it's pretty hard but very needed very exciting very exciting I I love like breakthroughs like this in in open source speaking of breakthroughs like this in open source I specifically love you definitely remember this rope scaling was one something not this was like crazy research and yeah definitely shout out to imoa the I think Chief scientist of news research he was here on the show when you guys announced herry uh oh yeah enrio the yarn paper it's also yeah I literally asked him on this group chat the for this show on on that then was able to extend llama 38b 32b just based on his work I just used a config and then other people managed to extend it way further to to 128k so that just happened organically too so I definitely love those break RS drro is definitely a lot of work that that went into different things but also speaking of a lot of work LinkedIn out of nowhere this week also released something that is like a breakthrough complete breakthrough in training as well I just wanted to shout them out because I I saw wingly and post about this Wing is the author of Axel Auto which which is library that many folks use for fine tuning maybe most folks use for fine tuning and he posted about this that hey I already integrated this in XEL LEL and this is called lier LinkedIn GPU efficient runtime kernel is a collection of Triton kernels designed for LM training effectively increase multi-gpu training throughput by 20% reduces memory usage by 60% this is a oneline code that improves training that you do by 20% while ding memory usage by 60% this just like when folks talk about we accelerate this is what I think about because these jumps they happen often on the show they they happen quite often on the show we've been doing this for a year and a half and I remember multiple hops like this while we've been doing the show that one line here can extend the contact window from like 4,000 tokens to 28,000 tokens then it becomes a then it becomes a paper then becomes an industry standard things like this happen and now LinkedIn out of everyone I just really wanted to shout them out releases again a on line kernel that improves and now was implemented into hug and face like trainer into flash attention it works with flash attention with pie torch with deep speed with all these things that is improving by 20% all of the training which is which is quite crazy just really wanted to shout them out and just highlight how important open source development is because if the only folks who train models are complete folks in the close Source let's say the whole world is Open the eyes and in Tropics for example This research wouldn't have been possible but if meta releases something like llama then LinkedIn can do all kinds of crazy research like this and then test them on llama and then see if whether or not this is possible so shout out to LinkedIn for this and then now I think we have breaking news AI breaking news coming at you only on Thursday ey you guys just know that I love uh pressing this button right folks who who come to Thursday I often are waiting for this button but also I think we now have two breaking news but uh definitely we now have a breaking news they're not necessarily in the llm area but definitely in the open source area and as always the best way to deliver breaking news in is to host the folks who actually are in charge of baking news themselves and the friend of the Pod frequent co-host of the part as well Jun yangang is here with us you guys just posted something why don't you tell the folks what breaking news you bring with you this week Jun yangang yeah thanks a lot Alex hi I'm J I'm a member of the Quin team yeah today it's very happy for me to announce the release of quent 2bl I think there are a lot of people noticing this work and I still remember for the first time I came to yeah Thursday AI it's because skull skip just uh retweeted some something about our quen VR Max and then Alex came to me and talked to me like when I come to Thursday AI to introduce your vision language model and this time yeah we finally open source the model uh because you will find that wow it's a bit amazing like that qu VR Max performs much better than other open source models and this time uh we share our latest model um uh including three sizes uh we have qu2 V for 72 billion and for smaller ones we have 7 billion and 2 billion and for the two small models quen 2bl 7B and 2B we open source the models under AP G 2.0 so you can just use it freely yeah for for your commercial usages temporarily for the largest one for the 72 billion which we are still providing the apis but I think later maybe there will be something new happen and you may access it in another way just like open weight yeah if you check our blog we have released published the results and in comparison with the close Source model this time we just compete with the close Source model including gbd 40 and CLA 3.5 and for example people pay attention to mmu you can find that our 22 VL 72b reaches 64.5 and only behind gbd4 69 it is too high for a little bit for mmu because I think it is highly related to the language model maybe our language models is a bit smaller like 72b yeah but for uh other benchmarks you can check for example uh the understanding of mathematics problems like math bua is the soda and for document understanding like Doc vqa and just so folks understand what you mean by this sorry jingang to to yeah to interrupt qu to did you just open waited beats GPT 4 on math Vista this Vision understanding model beats GPT 40 beats CLA 3.5 Sonet beats other best models like Gru on math Vista on this Vision task of looking at these math problems and solving them right like like a better better like this model is now a better score than gbt 40 this is like the top score now state-ofthe-art it's crazy yeah sorry go ahead yeah yeah I think yes if you check our Twitter and check our blog and you can find the picture of the yeah of the results you can find that there are a lot of Benchmark like mathematics understanding like math Vista and for document understanding like Doc vqa and table understanding like chart vqa in comparison with GPD 40 and Claw 3.5 we have some advantages over it and for for The Benchmark data sets and this time we have something new for the new capabilities for this model Quia it actually supports video understanding and for this model you can even input videos of over 20 minutes yeah we have tested yeah 15 minutes and 20 minutes it performs quite well you you can just question answering and chat with it about the video contents we have a lot of examples in the blog so you can check the detailed examples there for the video understanding and all the vision understanding and also some new capabilities it's about agent because there are a lot of people focusing on agent and they would like to use the vision language model for the agentic problems for for example if you are performing some agentic task you would like the agents to see the V visual informations to interact with the environment so this time we have some yeah visual uh agent capabilities for example for example like a code interpreter for example you can input a t a picture of a table just just a scan of a table and then you ask it to yeah write some code to visualize the charts just using some mapop live or or things like that and you can also ask it to interact with the environments you can actually use it to control some robots so you can connect it with your mobiles and robots so I think this is something more interesting yeah in summary I think there are uh three uh mainly four new features uh the first part is the St are performance in a lot of visual understanding benchmarks uh the second one is the video understanding and also think the intered images and video understanding and the third part is the visual agent performing task and the last one I just miss it actually supports multilingual there are a lot of people using Vis Vision language models to yeah to test its understanding of the text inside the images we have inputed the understanding of a lot of languages we have tested it can understand most European languages including Western European languages and Eastern European languages and also Japanese Korean and also Arabic Arabic is a little bit below but for Vietnamese it's okay yeah it's a general intro introduction so if you interested just check out BL yeah that's very incredible I have a few questions here but first of all thank you for coming thank you for breaking this down folks in the audience feel free to ask first of all follow J yank is incredible and gives us a lot of information and on Thursday I but also I have a few questions about the video stuff I really am interested in how the jump from just one image understanding goes into video understanding because there's quite a lot of stuff from just one image to a sequence of images to a video of 20 minutes how does the jump in those capabilities happen could you speak a little bit about that work yeah we have also released the details of our method especially the architecture if check the architecture part you can know more details so I can Briefly summarize the main architecture so for for the first part you can view video as a sequence of images it is a problem of understanding multiple multiple images for the video it is some kind of stuff yeah and for for our model architecture we are actually using I don't know if you still remember navit yeah navit is a very important work uh which uses Dynamic resolution for for images of different resolutions so you can put yeah multiple images of different sizes different ratios even different resolutions put it together to a long sequence so you can also put yeah videos just put it into it so you can make it inter Le and the the hard problem for understanding video is that how to understand the temporal relations yeah for for the temporal relations you you you can first uh put video into a sequence of images so there are a lot of frames to put it into it to understand the temporal relations we are actually using some new rope for the positional encoding which we call a multimodel rotary positional in uh embedding it's a little bit complex but if if you check the details you you can find that for each patch inside a video you have three dimensions for the first it is the temporal dimension which means that which time step you are and for the last two which means the height and width so you for each patch inside inside a video you have a position to to represent it but how to make it combine it with the text it is something yeah crucial but uh we have some new ways to encode give the positional encoding for each token of the text so little let ask a simpler question J maybe for for folks in the a who are not following the data science are super qu this is not anymore me breaking down just a video into five frames and asking frame by frame hey Capt model now understands sequencing understands what happened throughout the video in terms of like what actually happened right like the movement and change in the video itself yes yes exactly that's incredible so so you're just putting multiple images together just for one time into a model not just like us one frame a caption another frame a caption it do not like that yeah wow I lot of details to to yeah to process these problems but it's very interesting yeah tell me does it understand audio as well with the video or just the vision part of the video uh for this time it is just the vision part we only focus on the vision part but it is not that hard we are also working on quent to audio to combine uh audio just like uh understanding Vision to make the language model understand audio so for the next step we are going to combine the audio understanding and the vision understanding together so you you can make it a real video understanding including the audio understanding yeah no that's incredible and okay so this was my first question the the the second new feature that you said was around a gentic understanding because obviously once we put the these brains into robotics they're going to work around the world and they will want to do stuff I'm I'm assuming some people use agentic in the world of pixels right agentic is like function calling your agent runs and your computer does stuff this is not the agentic stuff that you're talking about you're talking about like real physical world agentic something that a robot sees and does stuff as well so could you talk about that part what kind of tasks does this model can perform like what kind of benchmarks now you're looking at visual agent stuff could you talk about those yeah for for the agent it's a complex problem to explain but for the first part we just do simple function calling to make function calling together with the vision part so you can just just as I said you can just read the image of a table and then you change the table to yeah use map live to to draw and another image yeah to visualize the charts or do things like that this is the maybe the simplest yeah agend up but our model is capable of understanding the positions just like understanding the bounding boxes so we are we are trying to make it combining with the uh environment to make something control control things to put to put the box from here to there and and doing these things like that we we're not really combining with a real robot hand but we are just using it in a yeah in in an environment to to make a test its capabilities we can find that it can really understand not only the visual information but also the position information based on the understanding of the bounty boxes yeah it is really interesting so if you check the Benchmark in performance you can find that there is a part of a visual agent and we have some test on some Benchmark data sets just gym card and Alfred yeah you you can compare with it that's super cool wow Jun first of all always awesome aome to have you here as you guys announce the news but second of all I'm very impressed by these models I'm already seeing at least it's always hard as you guys release something to also know whether or not this is cool but obviously I've been testing some of these models before I'm already seeing some folks that I trust in knowing what Vision models benchmarks look like some folks are saying that this is like better than mini CPM we covered mini CPM before which was very close to state-of-the-art in at least the area of small Vision models it looks like at least your small Vision models are either coming close or beating those probably beating those right so that's quite crazy because Min CPM was really good in the area of small models so it looks like we have another state ofthe art also in the small Model area as well and Apache 2 license in the small ones larger ones you said not Apache 2 but maybe at some point but open weights right just to sum up yeah for now it is not open weighted yet but oh for now no it will finally be open with okay that's super awesome J thank you for showing up and and telling us these models anything else you want to add for the folks who listening in there's a bunch of folks listening in where should they follow where should they go what else are we to expect from the quantum oh yeah yeah this is uh a milestone of of our development because we care much about our vision language model we finally would like to build things together to put the understanding of modalities and multitask together so for for the next step I think we are actually chasing GPD 40 yeah but before that we would like to reach stateoftheart at each aspect and then try to combine it together yeah that's very awesome folks definitely give junyang a follow and the shout out to the folks at Alibaba quen team for working really hard and giving us stateof the out models and definitely chasing after after the big releases and I'm very excited to to try and play with the video capability specifically I'm really excited to try and play with the video capabilities specifically J as always feel free to stick around with us and feel free yeah to stick around with us nist any comments before we move on or ldj no this is pretty sick I'm I'm going to try it out with some ultrasound videos and medical stuff because you can actually because of the open license can actually do that research now this is freaking sick yeah I I don't know because maybe some people don't understand what's going on no th this is freaking dope it's and you can just use it it runs at high speed so yeah I I guess it will take a couple of days to digest what happened but yeah great job yeah cool yeah thanks a lot for the appreciation yeah for always yeah awesome all right so lovely breaking news I was unnecessarily worried that the open source chapter of uh Thursday ey is not going to be full of news and I was unnecessarily worried cuz always we have at least an hour to talk about and now there's a bunch of stuff that I wanted to chat about and we don't have time to cover so I think we're moving on because it's been an hour and we're here and now we have to move on into some other folks other sorry conversations before that I wanted to just cover that while this show has been happening week by week and it's probably presented by W and biases uh we have a coiner that covers we and biases related news and so I call this coiner this week's Buzz I don't have a transition for this corner but let's say that this is going to be the transition so in this week's Buzz uh I have an announcement that we have a hackaton coming up it's in San Francisco and if you're in the Bay Area you should really come out honestly if you're anywhere in the United States and you want to convince your boss that it's worth for you for your professional development to just fly out and you're working with llms and you want to learn how to improve them in production there's nowhere better to come than the hakon and San Francisco that we're organizing in in the area of building llm as a judge because if you're building anything in production and you're not using anything for observability for example our toolkit that's called weave then you're basically you don't know what's going on in your application in production and you should and that's what we built we for and so we're building a hackaton to actually build tools to judge whether or not your outputs on production make sense by using another LM that's called llm as a judge there's a whole field of research that's that's happening around this topic and we're building we're Gathering folks together you're more than welcome to come also and learn how to do this with other folks come up with different methods there's a bunch of research there's never been applied to any of the newer lolms for example or never have been applied to let's say llama 3.1 for example we're going to have credits we're going to have experts Etc so you're more than welcome to come and hack with us there's going to be awesome prizes that's going to be in the September 21st and 22nd in our seisco office I'm going to leave a link first of all follow me but definitely a link in the show notes after this we also have a advanced rack course with the folks from coher and folks from wave8 and if you that's going to be completely online and for free I'm going to also add a link for that as well if you want to improve your rag systems that's awesome folks from both these companies we also have some breaking news from kohir I was hinted by some folks that there's some new breaking news from cir as well so we're going to mention this later in the show but we have a new upcoming course with the folks from from coh and folks from wet about how to improve your Rec systems as well this is in the this week's busz category and now we're moving into our big companies's llms and apis area basically let's [Music] go and in this chapter of Thursday I I'm very excited to have a an announcement of kind of big a big announcement because it's not every week that a company comes and says hey we've done something that other companies haven't done before we've broken the ceiling of the fastest ever inference that we've ever seen before the company was cerebras and they announced their new inference API and they broke I don't even know I think the latest that I saw was for llama 3.18 billion parameters the highest that I saw was 1,870 tokens per second and so I have the pleasure of Hosting James here and Ian from cerebras to chat about this incredible feat and what does this mean and how the hell they are doing this so welcome James welcome Ian I'm very happy to have you guys here to talk about how this happens doesn't just happen so welcome please feel free to introduce yourselves and let's talk about it sounds great Alex wow this is such a cool show I feel like I'm watching Good Morning America but for AI just just for AI though yeah yeah it is so well I did not expect it to be this I remember Twitter spaces back then used to be just a hangout space but thank you yes we we launched cerebrus inference this week very exciting launch for us because for those who followed us cerebrus were famous for our giant wafer scale chip we built that the company started in 2016 I think we released it maybe 2019 it was very exciting but for a long time was a little bit enigmatic this incredible achievement in semiconductors um but it wasn't anything that anyone can really get their hands on um but today we uh this week we launched cerebrus inference which as as you noted is by far the fastest inference API people can use it's about 10 times faster than a typical GPU solution 2x faster than grock and really the reason we're able to do this is the very very custom and kind of crazy Hardware we built you guys probably know that the typical computer chip is like the size a stamp we build a chip the size of a iPad basically it's the largest that's possible out of tsmc we invented this technology together with tsmc and to this day we're the only ones in the world to do it when I joined cers about a year and a half ago I I thought this is really great but there's no really great killer use case for it that I can see I felt like the technology was was a bit ahead of its time I think what chat jpt has done has really given us like remarkable product Market fit and finally there is a model large enough that a chip this big can really Flex its muscles and so Ian and the whole team has been basically they re wrote an entire inference stack the chip was originally designed for training but we wrote an entire inference stack to make it run super fast and the really simple explanation is we basically store the entire model whether it's 8B or 70b or 405b coming soon entirely on the chip there's no external memory no hbm we have 44 gigabytes of memory on chip we can chain together lots of chips to aggregate as much memory as we want and since the model weights sit on chip the tokens just flow straight to the CPU was processor we're like the speeds today are actually nowhere near our top speeds people ask like why are you this speed not that speed but we thought this is pretty good enough to launch we wanted to launch cleanly with no way for anyone to usurp us so it happened to to S NOA in their prior launch and reception's been fantastic thousands of developers sign up yeah very excited to share the news yeah that's awesome I want to welcome you in as well talk to me about the the training before because I I remember the lunch of C3 is that what I'm talking about and I remember specifically around the announcement with Mayo Clinic I was very interested in this because to me the the medical the medical area is very interesting Mayo cleaning being one of the the most trusted places where you go online to to check out whether you're about to die of the two symptoms that you have and me cl is the most reputable Source they probably have the most reputable data their announcement with you together building their AI I was like very interested in this I for the longest time I wanted to reach out and talk with you guys about what are their training and they chose you and those kind of the super classet that you're also building which I would love to also talk about maybe let's start there before we go to the inference as a background because before you started with the inference you were training like and you were talking about training big models up to trillion parameters as well right let's talk could you mention this before we move to the inference as by way of background for folks who have never heard about cerebrus yeah sure training was the original kind of kind of design point for the chip we thought if you want to train models and training takes inference we've made instant now but training takes typically days and weeks we thought the best way to make that fast is once again to fit the model um entirely on triip um it turns out the models are so big that even the big biggest chip in the world can't fit the model on chip so we actually designed this entire different memory system for training called MX um that's in the paby scale so gpus are in the terabyte scale if you if you count a djx boox um our training systems memory systems are in the paby scale um and we can store basically a like trillion parameters directly in a single block of memory rather than spread it across all these different gpus so our claim to fame for training is that you can do training for like the largest models without using any distributed training framework like deep deep speed or or Megatron anything like that anyone who's looked at the code base for that stuff it is gnarly we literally counted the code 20,000 lines of code for Distributing training and only little bit of code for the actual ml so we basically get rid of all of that by storing the model in its entirety and doing data parallel only that was a very unique kind of a Innovation Mayo wanted a partner that really knew how to do large scale ML and we were a US company with a very specific technology and we just built a relationship with them and it's been great ever since we like basically we work side by side with their ml team their in-house data which they treat very carefully as you expect and we've been just BAS basically developing their next Generation AI systems that can work with their patient records and do rag and things like that so that's how that started I'm very looking forward for that effort and announcements or updates about that effort okay so that's working you guys are offering this assuming very expensive chip because that's why as far as I understand very expensive and like big companies how does the decision how does the conversion into inference happen and maybe Ian you want to take this one because I think James you said Ian and folks are in charge of the inference conversion because you talk about like how does the company that builds ships for training converts into how let's offer inference because inference is mostly for maybe developers like us like folks in the audience here maybe like smaller ticket items different approach different T would you talk about this yeah sure thanks thanks Alex great show I'm I'm happy to be here I'm Ian and I work in the software team and it's really exciting week to be at cerebrus on or launching this product but yeah as we said earlier the the actual chip itself was initially designed for for with training in mind right and and really the idea there is that instead of taking a huge wafer and cutting it a whole bunch of small Dy then putting them on boards and then putting it back together with networking equipment which is a topic from earlier in the show right just don't bother cutting those into chips and keep them all together and drive it across silicon now you're not driving power across iOS and connectors Etc and and that same thing Works uh in in in training as well or sorry not also in training but also in in inference and through we're a hardware company and I think if you've only got a training solution or if you've only got a uh inference solution you're not a a full kind of AI Hardware Pure Play company and so we certainly move into that from that perspective and join in I think training is maybe I'll say harder or at least bigger you can do inference with fewer systems for the the same model because you you've already figured out Which models which weights are effectively zero or sparse right and that's another piece of the architecture that's really quite interesting is that we are able to just do absolutely nothing we don't send zeros we just do nothing when it's a a sparse a sparse weight right so that that's something that that I think is very unique about our architecture and the underlying pees themselves are they them working more on like linked lists of actual values rather than on full populated dents data yeah so I think you you shared earlier that we were at you saw 1870 right I think we're we're quoting like 1850 in our marketing material the speed barries a little bit there but yeah in the 18 pluses for 8B for 70b we're at the in the 450 range and we're able to really produce tokens quite quickly with this uh product the if you've seen online any of the kind of the the material that we have where we talk about streaming weights on during training we stream weights on and then we receive gradients back from the chips then we uh produce the the update and then continue uh we're doing the same thing but in the opposite direction instead hold the weights on the chip and you're streaming tokens on and off the chip and then doing the dmed and then getting back to text characters back that you can send back across the wire right so we're essentially flipped it from what used to be a weight and was streamed is now a fixed on the device and now streaming it tokens in and out that's awesome thank thanks for the additional detail here first off we'll get some examples of like why this crazy speed is is needed I would love to to ask but some technical things about context length quantisation and other things could you Ian could you go into this context length definitely is is um very interesting but also I know for a fact in fact that you guys released a comparison analysis I think you are one of the authors on that block post a comparison between different providers and their quantized versions of Hosting specifically llama and how this affects different evaluations could you talk about this could you talk about Precision affecting and how how you Analyze This Could you talk about that effort yeah sure yeah we have a blog post out we also have third parties who have taken a look at the Quality metrics as well we we ran a bunch of standard evaluations on the metrics we are running 16bit accuracy and it we we are bleading in some metrics and and certainly comparable in the metrics across there and this is for uh um llama 3.1 8B and 70b right so we published results for mlu gpq math all those kind of things right uh all the human eval Etc and that that's available to see you can also see it on other people's thirdparty websites like artificial analysis they also have our scores corded there as well and it it matches the performance of the open source model from as intended we aren't cutting quarters and doing something like in four or something to really quantize and and get this performance it's full 16 bit Precision yeah I will say that we also ran an evaluation Thomas from our team and I just added a tweet from Lucas that also measured this performance in weave so there's a weave dashboard that folks can like click in and see our independent analysis as well that got to above 450 I believe for Lama 70b as well so in full speed 16 Precision but also I think you've tested difference in quantization on the different scores as well is that right is that what I'm understanding from this blog post the blog post we don't have different quantization now we're we're just focusing on our offering of this week oh I see so basically you're saying like we we put our model sorry we put the model in like fully full precision and we can stand behind what we put out there but you don't speculate based on what the different metrics in different providers could be attributed to for the different metrics right yeah no we don't have that speculation that's correct so James you mentioned 405b is coming could folks here get a little preview of um what else can we expect in terms of uh what other models maybe you're expecting to put in there are is there an expectation that you guys will put other models so just because you guys are new to the space to Thursday I we've been doing this for a while a lot of open- source fine tuners folks like news research who who release models like Herms for example which is a fine tune of llama and a bunch of other folks who fine tune open source which is one of the best things about open source you can take models like llama and do whatever with them are in this space and listening to this what's your what's going road map with now that you have this influence provide service first of all with llama going forward but also with other models going forward could you speak about like the future plans for the service yeah the future plans I think at this point are very much kind of feedback and customer driven we launched these two models because it was very obvious if we had to launch two these these would be the two you would pick M um but beyond these two there are a lot of choices right there's llama 405b there's mistro large there are all kinds of fine tude models so we are keen on on supporting a lot more and we want to deploy a lot more we basically like honestly this is a very much a startup e within a startup launch right with it's it's literally an MVP launch and it looks like that we picked right and we didn't wait too long there were some debates Inn on should have wait for even more models of performance but it seems like this is the right amount but I think 405b is a high priority for us right now getting long longer context out of these models is a high priority for us and also we have some engagements we've heard practically from everyone in the IML World wanting to work with us and build faster versions faster API endpoints of their models we have to make some selection there as well so the product team has been on overdrive in the last couple of days but I think high level I think that the the priority is making sure we can get longer context out of these existing models and also supporting some flavor of a really large model something like a 405 or a mro Larch absolutely I think just by way of feedback by myself I'm hosting this show like we're talking about models like 8B running on my Mac fairly usably right so I can run the 8B the benefit of mine running the 8B is that it runs completely offline I don't need like an online service to run an 8B it's incredible that you can run this in like insane speed it just appears there's a bunch of use cases which I would love to go into and hear from you whether or not you're already like seeing as feed back what people are using this incredible speeds for but the benefit that we talk about often on the show is that open source open weight models like this one of the benefits is that you can run them completely offline and then privacy for example medical use cases folks can run them completely on Prem without exposing without sending things elsewhere models huge models like 405b none of us can run anywhere like none of us can run anywhere and those who can I know Exel Labs have some where can they split them in chunks and run them on multiple Max Etc they run like one one token per second or something like slow the benefit the huge benefit of places like you where you run them fast is that those very great very big models now run in very usable speeds for startups and and they're open source they're fine tunable Etc we definitely would love to see bigger models probably the upcoming llama 4 that I heard is already started training or some point those would be the miles that we be very excited about for sure for sure I think people debate this whole time about client versus server or local versus cloud and I think if the reality is the modern Tex stack was built because you have both it's great having a local fast device but you also want this kind of unlimited resource in the cloud and yeah our we definitely want to support like a three- figure model and I think it will be just absolutely amazing to watch something like that go basically super like instant fast we will chat again sure when that happens 100% and you guys are welcome to come here as well I wanted to maybe uh keep going super quick questions before we move on now that you've relisted you've talked with developers a little bit what are maybe the early signs of some poto applications that this incredible speed offers that weren't weren't possible before that you guys are already seeing the most obvious one I think we we you talk extensively on the show is like anything an agentic workflow right um I think PE people are very impressed with what cursor has put out and the problem with agents right now is you're running on these gpus that are designed for human reading speed or they're not really designed for human reading speed like the I feel like we've all been tricked into accepting the current speed is somehow normal when it's just an just a indirect output of the gpu's memory system there's nothing magical about the current 80 tokens per second that we get on chbt um but the reality is this number can be anything you like if your Hardware is built a certain way and for us that reality should be closer to 800 um so a just normal people even using it prefer 10x speed no one prefers this kind of like Tex is streaming through like a 80s like RPG game um agents absolutely need it because if you don't go faster then your your agents are like just waiting for things and not actually doing stuff um and of course what my favorite application or research area for this is really trading this speed for some quality um to me the current Paradigm is basically llms just uh say every word they think but they don't have time to think before they speak because they have finite token limits and the only way llms really think right now is by saying it out loud and consuming their own tokens right yeah um and under this kind of raw Paradigm the faster you can generate tokens the more steps you can think um so if we can do a thousand tokens per second you can really maybe hide 800 of those and premeditate on that and spit out 200 that's very high quality and the answer you want rather than just force the user to consume Your Chain of Thought I mean imagine you ask me a question like let's see step one let me think about this step two it would be a very silly thing to do instead I think about it internally and give you the answer so I all the kind of industry speculation about the next version of chat GPT seems to be that it would do more of this reflective or behind the-scenes thinking uh and then get to a higher quality answer do multi-step logic and then give you the answer that would be great I think the key enabling technology for that is really fast inference so you're moving compute from training side to inference side um and that gives you a completely different user experience and and possible kind of products you can build so we're barely scratching this because the hardware for fast inference is so new um but I I would love for developers to to find a way to convert the raw speed into um into higher quality output yeah I would love to add that one thing before you guys go that one additional thing there and one thing that definitely requires a lot of memory is context length for that thing I know that one of the benefits of closed Source systems is that Tropic gives me 200,000 tokens Google gives me 2 million tokens when I host llama for myself on my Mac I know that like my 64 GB which is like not very common 128 is very non-common in in like personal Hardware the context length Taps out like super quick one of the benefits of your the cute chips is probably memory as as well if that's something that you can offer that any other folks cannot offer and that's like a big problem with Transformers as well I don't know you if you can speak to this if the plans for Contex length are next that's going to be like killer like speed is one thing but a long long Contex length with be that's that's the next thing for me specifically probably for other folks as well yeah I think that's the either the first or the second item that we're we're being asked for right bigger models and and longer content length and it was just really as James said a decision that we made early on was just to make this MVP launch and went out with the 8K context with at the moment um certainly in the next coming weeks we're we're going to try to launch out the uh longer context length at at ridiculous speeds as well incredible and can folks try this now is it open how can they apply this to this what's going on and then n I'll let you ask question as well yeah sure so they can go to the the website inference do.i and that will take them to the the chat demo there you can log in check out the get access to the the console which is developer playground I guess you call it and get API Keys start to start paying it the the API is open AI let's call it standard quote unquote right and if you already have applications running on another API provider all you really need to do is uh change your API key change the url and you ready to go yeah our we we have the inm data with demand so there's a bit of a weight list if you want immediate access just DM in on myself just say hi what you're building and we can put you basically send you a key straight away that happy to do with that James Ian thank you so much for joining and giving us a glimpse into how the world's fastest inference on super fast wafer scale chips is happening really appreciate folks in the audience if you want to try out this very simple to use interface because all you do is change the url in your open the eye capable code definitely reach out I think Ian or James you said in the DM mentioned that you heard it on Thursday I and he'll put you through maybe through the wait list thank you for joining folks give them a follow and definitely share feedback because this is what they release it for they need to know whether or not and how this API is working what you're using it for what this extreme speed is is available and used for you thanks so much Alex awesome show yeah thank you so much for coming feel free to S ground if you want we are covering other things but we are moving on to some other things that we need to cover thanks Ian thanks James all right folks other big companies news that we have to cover like super quick from Google we have a new Gemini they keep updating different experimental versions if you remember we have the standard Gemini 1.5 Pro and 1.5 flash 1.5 pro has been upgraded as an experimental version in the beginning of August and now we have a new experimental version called 0827 and now that's if you remember the previous one shut up to the top of LMC Arena that became for the week became the number one uh llm in the world uh but then GP CH GPT for our latest came and like dethroned it and now we're getting a new and upgraded 1.5 Pro uh that is now the second LM in the world it's number one in math number one in longer query and number two and everything else it's 8 0827 it's pretty good I'm going to use this today for summarizing this show I I've used it yesterday for some coding it's definitely pretty good I don't know if folks on stage had already a chance to use this 0827 the thing with Google is that they're going to deprecate the previous one in a couple of days I don't know if you used the 0801 I really liked the previous one if youve gotten used to this hopefully the new one is for you but the new one does do a very good job on coding as well we have some we look at lmc's Arena but we also look at like different things so one things that I'm looking for I saw Paul gorer here before from ather chat ather has a different Benchmark on the leaderboard for code editing if you're looking at different kind of approaches for coding this new model beats 405b for example at code at code editing specifically also there's the big code bench from Teru that we've talked with and this one also gets very close to GPT 40 on that uh bench mark as well so definitely good job from folks from Gemini uh the additional news from them is that their new Flash experimental shot up from number 23 on lmcs to number six definitely some improvements there although not across the board it looks like on some coding is not that great folks on stage have you had a chance to play around with new Geminis and have any feedback any ideas about them nist I know you you used the previous one for coding do you have a chance to use this new one and wolf from I yeah I I put like a good day or two of of use like put a good like 10 hours of of just like practical use on it on the 2708 and yeah it is it does feel better it refuses less than than the older one so I looked at the benchmarks after using it so the thing about Google is that on the engineering side they are first like you can just dump in copy paste any book you have and you dump in 300,000 tokens and it just takes it and it just starts chugging through it m other things that is is really good at is stuff like I do a medical talk with doctors every Monday and then I have to summarize it and make a really nice AI summary and I set up this entire like semi like atic was just a piece of code workflow that divides it into different sections and finds each person's debate points and then brings them all together in a super nice summary that that people expect and giz the first one where I could just grab the raw audio from it one prompt and undone it's amazing for that kind of that kind of workflow and specifically this one the 2708 because I also tried it with Claude and stuff and it it was about the same really and even llama 405 B was a tiny bit worse so in that regards is good what it sucks at is code don't there's better models for code you're not going to get anything better at least this is just like my opinion but I wasn't seeing it do any better than deep seek or Mr all large I still have to rely on gbt 40 or or Sona 3.5 B to to actually fix the issue and the other thing I also tested because they say now what's the output context length and you can set it up to higher than 8K all right yeah but I wasn't I wasn't able to actually get it to Output in the interface more than 8K and when it was outputting like after 5,000 tokens it started really slowing down on it yeah it's on another level like it's on a whole other tier when it comes to the engineering of the product and how it handles daily workflow because you can just throw stuff at it but again as we go towards the really long context window two million is not 2 million and it is extremely annoying to not know what it Knows by heart and what it doesn't know by heart maybe it has a hybrid graph whatever rag system in the background it is extremely frustrating to not know that because with mrr it's 128k you when you dump in the whole code base and the code base is 90k it can actually write out 20K of of code and it will do the job it will be a little B slow but it it will get the job done and I can rely on it I can trust it for the job I cannot trust Gemini for the job because now I don't know if this thing is trying to like look up in the past or do uh a rag call and maybe that rag call only brought one paragraph of the code from the past I I can't trust it I need to like redump in everything whereas with again even Claud really slows down on the API after you go over 100 to 50,000 tokens but you know that it has all of that shoved in its brain and it's not cheating and looking codes so this is just me I get very frustrated in daily use mainly for for coding stuff when uh when that is that is the the case so yeah that's my review after 10 hours of use yeah thanks Nan for the review the additional thing they released is a new Gemini flash 8B they've been testing this a while and now they have a probably distilled version of 8 billion parameter Gemini 1.5 flash 8 billion parameter they released it for testing you can test it in the AI Studio UI and what can we say about this is not the Gemma model so it's not like open weights or anything they are hosting this I don't know why they decided to disclose it's size as though I don't know why would we care if it's in Via API but fine and folks can test it and and tell us what they think it's moderate the the interesting thing that I found about this is that LMC Serena says it's coming close to llama 70b llama I think llama 3 not 3.1 llama 370b or just llama 70b I'm not sure but basically that's all I I found about this the additional thing from Google is that they're adding gems gems is like gpts you you able to preload Gemini with like context and different things and they've been talking about a while before this will be only for folks who pay for their premium service so if you don't pay for open the a cloud which at this point if you don't have access to Advanced voice mode I would say go for cloud. and I will in a second tell you why but they're also adding imagine imag three to their pay tier as well these are the news from Google honestly the one thing that I will tell you that I'm noticing is that they're giving an incredible amount T tokens for free for folks especially via the AI studio so you don't even have to use their like I'm sorry for folks for Google product who listening who need like people to pay for the products but if you go to AI Studio they're giving like billions of tokens for free every day VII Studio yes it's like less of a chatty interface if you go to I Studio you're literally getting all their like best models for free yes they're not connected to the internet for example they can't look up things for you but you're getting the same best models for free even experimental ones so you can plug them in into whatever chat interfaces you want and you get the same experience but you don't get jams let's move to entropic one thing that entropic started doing is publishing their system prompts shout out to them we know that some folks like plany Liberator and whatever they try to go after system prompts and liberate them and Tropic says we actually are going to tell you our system prompts from now on which is I think is a great thing they released the system prompt for claw 3.5 and they just told us this I think in a tweet I think it was and I applaud this and I think all companies should do this and I think that's great and now they will release this in the release notes which is I think incredible and more companies should do this including open AI speaking of open a I don't have a lot of updates besides the project strawbery thing and also we have a Tweet now from Sam Alman that says don't get too excited folks nothing St related he just said we're happy to have reached an agreement with the US AI safety Institute to pre-release testing of our future models for many reasons we think it's important that this happens at the national level us needs to continue to lead this is a tweet from Sam Alman from 10 seconds ago this is probably related to the 10 SB 1047 AI safety Bill uh us AI safety Institute is now talking with open AI good but nothing too important to talk about there and moving on so back to entropic releasing system prompts is great more open source we can learn a bunch from their system prompt about prompt design it's really incredible the other thing the entropic did is released their artifacts available to all users and not only premium users so basically one of the reasons that you would pay for entropic was artifacts and now you can just use them if you haven't used artifacts because you haven't paid entr Tropic yet you should give the a try and artifacts is basically this you can ask Claude to make things for you Claude will basically open a side panel and start writing code this code will be like frontend code like react or whatever and it will not only write the code it will actually execute the code in a secure browser environment and will show you the code if it's like a browser code it will build you an app and actually run the app and you you'd be able to play games so you'd be able to ask it to hey build a game and and Run the game alongside us and you can share this artifact with other people this artifact can include multiple files as well you can ask it to hey build me this asset build me that asset build me this asset multiple files and then hey use all these assets together in this game so it include multiple files and then you can send this tic artifact to someone else if you haven't seen videos of people just building apps it's incredible not only that they've now added this to their mobile apps as well so now the artifacts panel on the mobile app we're on the this is a recording show but fine people now building apps on their toilets basically the whole thing is that you can ask cl to like hey build me an app that does this and will just literally just spit out an app and show it to you in real time you can play uh little mini games and Etc it's really worth it to just go there ask for a nap and just get whatever you want and just ask to for it to fix something if you haven't done this try it Adam have you built an app with CLA artifacts and what did you build and how cool this is yeah like I built quite a few little applications using it it's really impressive and even looking back maybe a year ago we were using meta GPT to build like snake game and it would always get these errors it would always get these problems and it's pretty cool that like almost in a single or maybe two or three prompts you can get like an entire snake game built recently I built one that was a form for an event we were hosting and I probably would have taken like a few hours of engineering time and I literally did it for my mobile phone it was crazy to see I'm really bullish on it I think that there's going to be a lot of applications going forward that you're going to be able to automate the entire n10 building process using Claude plus cursor one guy is really great to follow in this Bas is McKay Wrigley he's just been showing awesome examples of how you can get started very quickly with this type of stuff but I think we're I will say still very early I think we're like maybe using a baseball analogy I think we're still in like spring training I think right now you're seeing very simple applications being built out but as time goes on you're going to see non-technical folks being able to build full end to-end applications and this is really going to accelerate software development to a different level so we're really excited to continue following the progress yeah absolutely thanks Adam the thing that that folks that they showed on their press like the blog releas is that designers build visualization for quick prototyping people just upload their designs and ask hey Claud build this thing for me this reminds me of vzer from versel that just released like a chat interface as well this week marketers build campaign dashboards for metrics they just like upload csvs hey build a dashboard for this I did this actually there was a tweet from someone this week or last week I think this week they talked about something campaign related something about pricing whatever operating whatever he cited a bunch of metrics and he's hey operating loss whatever I just took this whole tweet and just gave Claude hey build me a dashboard visualize this and Claude buil like a like an interactive dashboard with all of these things it's quite incredible to have seen this and it's like immediate interactive I could have built this I I I used to be a front developer I could have built this I could have spent like the three hours importing dashboards whatever but Cloud just built it and I have the skills I have the developer skills I have cursor but like designers can just like now import thing so if you haven't if I didn't get you excited about artifacts go play with artifacts and then tell me if you're not excited yet it's really dope shout out to the folks in the Tropic that decided to give us all this power now and I think that this is all I think that's all right in the entropic artifacts anybody else have played with artifacts and wants to tell us what they shared what they built no all right go ahead I I turned them off because like I couldn't copy paste the whole code like they would hide the code yeah you can turn them off as well ldj go ahead yeah I actually found it pretty useful I ended up making some mostly claw doing I I helped it guided it to fix certain bugs in the code but it made an artifact where I could pretty much put in the parameter count and data set size for a various training run that I might want to run and then have specifically What GPU type I want to use it for and what training length then it'll tell me exactly the amount of gpus estimated for that run that I have to rent out I think LJ you're highlighting maybe the best use case for this whole thing Mini apps specific for the the thing that you're doing right now I think this is the era that we're in like like literally building things for hey I need this thing right now for this purpose I think we're coming to this era and I think that's incredible and they're sharable as well and they're running the whole thing with GPT 4 if you remember there's like the interactive Advanced analytics mode whatever that they used to run code it didn't there was no visualization it would run the code and then show me something but basically now we're at the point where like we can actually build apps and have them run so that's super cool I think we we've covered the big companies LMS and apis enough and now we're moving to AI art and diffusion AI art diffusion models text to video and images and everything in between only on Thursday I so we have I think two updates in the AI art and diffusion as we are coming up to the end of the space I have Jonathan here Jonathan you guys have an announcement and then we also have Simo the one thing that I wanted to talk about before we get to your announcement jonan is game engine folks from uh Deep Mind basically released research and some videos where they trained a version of stable diffusion 1.4 on a bunch of frames of Doom and for the old question can it run Doom the answer is now yes it can generate Doom basically they trained stable diffusion 1.4 with a bunch of images from Doom and now in almost real time they are in 24 frames per second they are generating 2 24 frames per second Doom game so now you can literally see this I think I'm like talking about the is not enough I think I need to go and find this maybe n can help me out it's called game engine and it just it looks incredible they say that they've tested and the human Raiders they tested it in are just slightly better than random at distinguishing short clips of this thing from an actual game video I'm still like I still can't believe what I'm seeing this at first when I saw video clips of this I was looking at it I was like what's the point they're showing me Doom vide I've seen Doom videos before but no this is like a stable diffusion trained fine-tune thing that they generate every frame of this with stable diffusion they generate every frame of Doom with stable diffusion as you as though you play the game I think that's like absolutely mind-blowing it's called game engine uh I don't think they released anything but and the the monsters they generate like when you shoot it's absolutely incredible yeah I just wanted to bring this to people's attention because I think that this is basically where we're going I think Jensen had a a a statement where every pixel you'll see will be generated and not rendered I think this is the main statement from jensa that we're waiting for and this is the first Glimpse and specifically stable diffusion 1.4 is the first open source that was dropped since then we've Advanced a little bit in in the ability to generate very high quality imagery and I'm very excited to see how higher quality can we go going forward the thing that I don't know is that whether a person drives this or is this just generation I don't know if folks saw this and read LJ I don't know if you had a chance to look into this and see if this is like a person playing or this is just like a generation of a video or not yeah I'm not sure either I didn't look that deep into it however I did notice that people were mentioning I saw in the video how it does have some inconsistencies for example where the player will turn around and then do a 360 and like a barrel that he already shot is spawn there yeah like it shouldn't be so it's not perfect and it's not it's like good enough to fool anybody or most people watching for let's say 5 10 15 seconds right yep yeah I'm I'm assuming there's not going to be an actual playable Doom where the designer designs the level and it's hard when like a human made sure that after you enter this door there's going to be a monster all the time that it's really hard to fight I I'm assuming that we're not there nearly close but uh I still think it's remarkable that it just looks so good it looks so good right folks this is like a game engine and it's generated it looks like Doom uh and I think the comment that I saw was if it doesn't have the sounds that is it really this game I have Jonathan and Sima Jonathan you guys also have a a release announcement today you guys are from fou I've talked about fou multiple times on previous shows because of because of different things but also flux is my latest kind of excitement and I've been playing and turning my face into different memes and a lot of it is because J on some of the stuff that you are baking so feel free to introduce s introduce and see as well and then talk to us about what you guys are releasing today okay cool well Alex thanks so much for having me I'm I'm glad to finally make it here yeah we have a a big announcement we've been doing flux training for a few weeks now I've been like you I've been having the best time ever just spreading my face across the web we just recently brought Simo on and he came on and looked at our trainer and said we need to accelerate so we were training about 20 to to 40 minutes depending on step count and uh now we have a new trainer that we've been able to train from like around two to to five minutes so a huge Improvement in speed but also in quality the quality is better so we're super excited to finally publicly release this yeah so um definitely go there we also have a coupon for people that are here if you guys want to try it out cheaper where should I post that just is there like in comments yeah you can go to the bottom right and just like comment with with a coupon so folks can see it if you want or just say it out loud so only folks who are listening can get it if you want that's also fine I'm gonna I'm was gonna hit reply here it's just go to foul. a and then you put in the the parameter question mark and then it's the coupon equals Thursday AI so that will apply the coupon and you guys are going to get a discount so definitely go check it out it's so fast it's crazy and really fantastic quality so all the props go to he did a great job there it's only been working with us like under a week we got to deploy fast we're super excited about this we think it's going to for it's just it might open up some doors in terms of applications now that it can go so much faster just to get it just to understand it's like around five minutes or something to put my face into flux it's going to be under five minutes I think the defaults I can't remember what they are I think they're a little bit under that and yeah we'll and we'll see where we get to next week so Ju Just for folks I think folks know because I've been talking about this kind of non-stop the whole benefit of flux and open source generally in these models is that the whole ecosystem applies to these models and I've been really enjoying like to see how everything just like aligns itself on top of flux especially and being able to see you guys give us the tools to use this and now thank thanks to you and this is not sponsored the only thing that I basically get from this is all of you who are listening to this now get a little bit of a coupon to to file to actually use and maybe train your own Laura so thank you Jonathan thank you Simo thanks batan who was here and have to jump to a meeting I think so folks who are listening to this you get a little bit of an extra to also like use file and do the same things that I did train your own Laura doesn't have to be your face but if it is your face put it in the meme it's really fun it doesn't have to be it could be your cat it could be whatever you want but you should just try this try training a Laur it takes 5 minutes just upload a few images and then you can take another Laur like you can com B them it's really fun it's really fun and yeah this was really dope and shout out to Simo Simo if you want to add how you did this if you want no do not add how you did it yeah oh yeah right yeah this is not you shouldn't okay yes but if at some point you guys want to release anything and at that point you want to come then you're more than welcome I think we have another breaking news LJ you want to go yeah sure so right now wait before before we have the button coming at you only on Thursday after all right this is crazy I just saw this right this is like literally just came out yep so a company called Magic dodev that Nat fredman and I think actually kpoy and some other big people invested in this company magic. deev it was at least a few months ago maybe up to a year ago at some point uh Nat fredman was saying some somewhat cryptic stuff about how it was blowing certain benchmarks that he saw out of the water and he immediately wanted to invest in them and they just announced some updated research right now that they're publicizing about 100 million context models they're working on and the first official one that I guess they're putting into production or something it just came out so it's really fresh but yeah I guess we could put it up on the billboard or something it's crazy here and I'll also pin the Tweet of n fredman mentioning how he was amazed by their initial most time 100 million tokens account we were just talking about before we like whether or not the 2 million token is effective or or not and now did they release anything though or is it just an announcement with the funding thing again because I keep waiting for them to give us anything but they talk about like long-term memory models longt memory is something that like we keep waiting for but I don't see any releases yeah so they describe some evals I'm not sure how standardized they are or if it's completely new evals that they had to make I'm scrolling through it now but it seems that as far as I could see it's a somewhat Long blog post so yeah we're reading through this now as we yeah but it looks like they are showing some yeah okay we will definitely do our research and and get back to you on this but I think it's like super super exciting it looks like there also a partnership with Google Cloud to build a cluster together and then yeah this is is crazy this is crazy 100 million tokens for long-term memory with magic and it looks like a bunch of very serious folks have invested in them Eric Steinberg the CEO talked with Nat Friedman oh sorry with Daniel grass and I had a video of this conversation where we talked about what long context means all right folks it's been 2 hours and 5 minutes and we've had a lot of stuff and we've covered pretty much everything successfully and as always thank you so much for Jo and listening thank you as always to our co-host nist yam ldj was here Ryan before thank you for thank you to Our Guest Jonathan and Simo we had James and Ian we had intercell ninja as well from news research and Sharon as well and thank you everybody who listens wol from was here as well earlier thank you everybody who listened from week to week for Thursday I who comes here to listen to the news to updates everybody who comments we had a lot of comments today I'm sorry I didn't have a chance to look through the comments unfortunately because I was trying to also listen trying to Wrangle the news but I really appreciate you listening in and sharing and commenting and we'll see you here next week also and if you missed any part of this the show is also available on the podcast and as a newsletter as well so with that I'm going to go and actually prepare the podcast and the newsletter and we'll see you here next week cheers everyone [Music] n [Music]

📅 ThursdAI - Aug 29 - AI Plays DOOM, Cerebras breaks inference records, Google gives new Geminis,...

Share your thoughts