Cerebras Inference The world’s fastest LLM inference

Published: Aug 27, 2024 Duration: 00:05:09 Category: Education

Trending searches: cerebras
hello guys welcome back to another session so cereus inference is the world's fastest llm inference engine so now you can generate inference at the rate of 20x faster than gpus at one with the cost so let's take an example so go on a long rant about that jokes as you can see it is very quickly generated the response within 1828 tokens per second so let's also try Lama 370 billion so the Llama 3170 billion uptain 446 tokens per second this is ridiculously fast they also have a API offering which is still under weight list so let's see what they have done and why it is the fastest so they have built this engine called third generation Ware scale engine which runs at an inference scale of 1,800 tokens per second which is 2.4x faster than Gro on Lama 3.18 billion they claim that it is 20x faster than GPU Solutions at 1/5 of the price as you can see here Lama 3.1 at 1800 tokens per second on cerebras compared to 750 tokens per second on Gro and on the 70 billion parameter model it is 20x faster than hyperscale clouds so why is inference very slow it's because it's a sequential process each word must be generated before the next one can begin for example given the string the quick brown it has to generate the word Fox by going through all the layers and once that is done we append it to the string and then once we fit it to an llm it comes up with the next word so this is a sequential process so what is the reason for slow generation it lies in the sequential nature of llms and the vast amount of memory and bandwidth they require in llms each word is generated must be processed through the entire model and all its parameters must be moved from memory to computation so generating One V takes one pass means 100 vs require 100 passes since each V is dependent on the prior V this cannot be run in parallel so here is the memory and then there's a computer and it produces the V so at the entire model must travel across the wire to generate one token so all the layers needs to be loaded and then computed each layer by layer to generate the what in the end so 70 billion parameter model is 140 gab of data so to generate th000 tokens per second you need 140 terab of memory bandwidth for example take the popular model Lama 3.1 70 billion which is 70 billion parameters and if it is in 16 bit then each one requires 2 bytes of storage which means total 140 GB of memory so for the model to Output one token every parameter must be sent from memory to the the compute course to perform the forward pass inference calculation since gpus have only 200 MB of on chip memory the model cannot be stored on on chip and must be sent in its entirety to generate every output token the problem with GPU it has only 200 MB of one chip memory so every time you want to do something you have to send the model to generate a token because the model cannot be stored on the so generating one token means moving 140 GB once from memory to compute so to generate 10 tokens it will be 10 into 140 which is 1.4 terab byes per second of memory bandwidth so h100 has 3.3 tab of memory bandwidth which is sufficient for slow inference but instantaneous inference would require 1,000 tokens per second or 140 terab per second so how come cereus lift the memory bandwidth barrier so by storing the entire model in on chip SRAM increases memory bandwidth by 7,000 x so in the new system which they have developed they have an SRAM memory combined with the compute course in one chip so that the whole model is fit into that chip so cerebra solves a memory bandwidth bottleneck by building the largest ship in the world and storing the entire model on chip with their unique waer style design they are able to integrate 44 GB of a on a single chip wafer scale engine has got 21 paby per second of aggregation memory bandwidth which is 7,000 x of1 h100 so large models easily scale across multiple wafer scale engines so with four wer scale engines you can have 176 GB of memory so they are using 16bit model weights for the highest accuracy they also have an API where you can just be compatible with open a passing in the base URL and a AP key and passing in the model name so feel free to try out the inference at inference dos. guys thank you bye

Share your thoughts