Transcript of Cerebras Co-Founder Deconstructs Blackwell GPU Delay

Introduction hey everyone I'm James Wang director of product marketing at cerebrus today I'm joined by JP fricker who is the chief system architect and co-founder at cerebrus we read in the news that idious black wow is delayed and it has something to do with these uh intricate interposers that they need to stitch together gpus and memory and I thought I'll talk to JP since he designed our whole system architecture walk us through what might have happened and how we do things differently JP so this this uh interposer issue you are you surprised at all no I'm not surprised actually quite expecting it um uh it's a very tough problem to uh have those large uh processor on very large interposers but maybe we should look at it uh maybe a little bit closer to understand how an interposer made so um How an interposer made here we have a the floor plan of a of a package of a GPU and in this case you can see there are two logic um die um that are the core of uh the processors surrounded by multiple memory die this is a typical package for such large uh processors but previously it was only one GPU corwell is the first to have multiple logic correct so previously imagine you have those memory devices much closer to uh to uh the main die and the package was much smaller almost half the size oh wow okay um what is happening when you build this is that in fact uh you need to think about it in the third dimension or a cross-section view of it so at the system level you have printed circuit boards within your servers on those printed circuit boards you need to put some processors and what people do they place the various silicon dye that are manufactured on Wafers um they place them on packages substrates so here is what one example of such um substrate this is the interposer so when you have multiple D the substrate becomes named interposer and mostly because you you have something that's inter between a die and a printed circuit board you start to have a a substrate and sometimes to create the connectivity between these D you use an interposer something that is in between so sometimes they put an a silicon interposer on a packet and then on the printed circuit board sometimes it's directly uh uh stacked as it is shown here so let's simplify it and just say this is the silicon oser and you have multiple D that you need to interconnect with between each other so connectivity between the hbm and D now what is different uh with the blackw generation for this layer versus previously with h100 this is different the middle you you now need to interconnect these to logic uh die with a very high number of interconnections between them so previously only logic to memory now both logic and logic yeah so you need a very high number of interconnections and uh and because you now need a a larger substrate and those substrate are quite expensive people decided that hm why don't we use silicon Just For What is super high density so instead of using the entire interposer as a piece of silicon why don't we use these are as silicon devices silicon devices are nice because you can on them instantiate wires that are very very small okay now the substrate now becomes a combination of an organic material between uh layers of fiberglass uh and stacked with various organic materials and these Bridges that's the Silicon bit that's the Silicon Beach okay so we have two ingredients now that sounds all fine what what would be the problem well you need to align these very very carefully so that they can match with or align with the pins that are on the bottom of these D and on the bottom of these memories so you can imagine that when you want to align these Bridges you need to either either start by aligning these bridges on the substrate and then put the D on top that's one technique I believe that's not the technique that's used here they might be doing it backwards where you actually put the die somewhere and you assemble the bridges on top of the first die okay but everything can move as you process this right so just in manufacturing it just placing it placing it maintaining that in the right place uh and making sure that as you assemble this together nothing moves is very tricky and when it's maybe manageable within this Dimension as soon as you get bigger first you have more pieces to put together but you also need to uh have longer distances on which you need to maintain the same Precision what kind of scale are we talking about that we have to align like these compon so typically the pitch here is um 10 to 50 Micron wow pitch for those inter connections right that's a typical scale for those the HPM now are getting smaller and smaller and smaller at each generation and and that becomes a nightmare of alignment okay I hear alignment is part of the issue I also hear thermal expansion differences are part of the issue what what's going on there so when you have a silicon these components here are made of silicon this is is made of an organic material um these bridges are also made made of silicon uh this one is made of printed circuit board um they have a coefficient of thermal expansion um in the order of 10 parts per million per degree uh C or degree Kelvin silicon in itself is a CTE that's in the order of 2.6 uh PPM per degree C or kelvin so you have a difference internal expansion so when initially they made this package structure on a large piece of silicon interposer the city coefficient of thermal expansion of both the the the the logic die the memory D and the Silicon IND deposer were matched nice so that's on h100 on h100 okay then you go with the a more advanced technique where you use Bridges um in that case you mix a silicon and deposer with another type of material it's a little harder to mitigate those differences of thermal expansion either when you assemble or even after assembly because even after assembly this starts to expand and contract at different rates based on temperature so sometimes it starts to Bow and it bow differently bow it's like curves it curves oh my God that doesn't sound so when you start to curve of this you can imagine all the tiny contacts that we're supposed to tie these two together they start to break wow so and it's very hard to maintain that a lot of testing goes in it and it takes a long time to verify that it works well and maybe here they didn't do enough testing or maybe they were surprised by some scalability techniques um especially in production you can make prototypes maybe very easily but then when you scale in volume the manufacturing process has to be adjusted for volume manufacturing and it becomes harder that's uh that seems to be the problem at volume these things Design decision starts to warp break and and uh this assembly this technique of assembling multiple different ingredients different nodes into a single package seems to be like a nightmare yeah so there are two aspects right there is the material difference that contributes to the problem and there is a size right because it's bigger this is the first time it's gotten significantly bigger since it's logic to logic now um when you AR detected the cerebrus uh system you considered all of this and ended up with a very different design decision can you walk us through that yeah so when we um started the company we thought that we would look at the problem holistically and we realized quickly that we needed more logic and more memory close to logic and so we wanted a die that was bigger and bigger we also looked at this technique of reassembling uh chips um but very quickly we realized that the problem that we would incur with one or two would be far worse when we go even bigger I see so even if you solved it for two chips once you go to 48 you're very limited very quickly it becomes exponentially uh actually um it it scales with the size that we you want to build right and we thought that it would probably not be achievable uh even in the time of a couple of years see to optimize the process to achieve this that was around what 2017 2016 we started the company so I spent about half a year uh to look at various uh techniques and then about a quarter to further uh study uh um yield uh models to understand how we could yield this okay and very quickly we realized that the combination of Dimension various materials very ious partners for different components and very many steps in the process would probably not yield correctly okay there were too many steps too many parts too many partners to um combine together so we we decided to simplify it and see if there was another way to do it okay walk us through it so what we wanted was a large piece of silicon okay a wafer is about 300 mm in diameter so this diameter is standard wafer size and on that wafer usually people build a step and repeat pattern of radicals Nvidia does the same everyone does pretty much the same um but then what people do they typically test one of those one by one and once they found out which one is good and which one is bad they Mark the bad one then they cut the wafer they dice it and and then they only keep the good D and the bad one they discard them and then they package this onto a substrate what becomes one those exactly and when you want more silicon what Nvidia did they said well I want two of those on one piece of of silic of substrate we found out that well instead of cutting that what if I could afford to have defects no one can build a wafer with zero defect so imagine you have many defects everywhere everyone designs chip with defects like SRAM can uh be designed such that you have a little bit of redundancy and you can cope with defects but if your entire logic design is capable of dealing with defects um then you could actually yield an entire wafer if you have the entire wafer at your disposal why not in one area of of a reticle actually create a core that has both logic and memory very close to each other this solves the memory band with uh wall that has played the whole microprocess industry for decades so these two pieces being so close to each other allows you to have very very short wire much shorter than what you have here very small amount of capacitance there therefore very low U energy requirements to communicate and also you distribute the memory over a larger surface which allows you to better serve the various cores within a given radical or large piece of silicon while here a core that's in the middle here needs to Traverse and go all the way to the edge to get some information all the way back I see right so there is a very long latency to get access to the memory while here you don't have that latency you have both bandwidth and super low latency yes uh effectively pree every other architecture the memory and the logic are separate we've combined them into a single piece because we're able to manufacture it so correct we also made this core quite small the geometry of this allows us to place about a thou thousands of those cores in each Dimension X and Y dimensions of our waiver which allows us to get a million cores if among that million cores we have a few that don't work it doesn't have a big impact if you have a few cores that are gone in a GPU it's okay but if you have hundreds of cores that are gone and because the cores here are bigger than our core that might be relatively smaller if you have failures you need to disc core needs to cope with many failures that for us would just be one of the cores being defective and we can make this core um if it is defective we can abstract it remove it from the array logically remove it and Abstract it to the upper layer software and software can see this entire wafer as being a perfect wafer that's great let's let's get back to this um so how does this architecture avoid some of the complexities and pitfalls with the packaging problems so first connectivity because you have these small cores you can now actually have many of them interconnected on one reticle very closely but also across retical and the lographic process that you can use to manufacture the wires across is the same process that's used for wires within the the the given D the same process compared to like the interposer so here you don't need a bridge right here you needed a bridge to go from one die to the next here to go from this die to this one because they're not cut I can use use the exact same material native connection it's a native connection and it is built by ex Optical exposure as opposed to a physical placement of a part that requires realignment here I don't need to realign I see I see these are this is one piece and that's like plugging in something into the wall correct okay okay now I got a huge amount of bandwidth between radicals equivalent to the band that have within one radical across the various P that's one aspect another aspect is packaging it it's much simpler the number of parts that I need to use to package it is reduced I don't have that many parts I don't have that many assembly steps to put it back together I only have one way which we call a wafer scale engine one wafer one printed circuit board and one cold plate in this case they need more right you need the substrate you need to solder this substrate to a PCB you need these various D to be soldered to the um uh substrate you need the Silicon bridge interposers to actually create the connectivity so a lot of Partners and assembly processes involved I see if you Small chip approach call this the small chip approach it look looks like as we uh progress with AI and it require more memory more bandwidth the small chip approach gets increasingly almost exponentially more complicated the recipe gets more complicated the tolerances are more demanding and we stay as a single uniform chip with same toleran correct so we alleviate all these challenges completely alleviated that's that's Waferscale interesting you know when I when I remember when the first wafer scale engine was announced in 2019 it it was it was very elegant but the advantages were still a little bit more abst TR because on the GPU side uh they hadn't run into these issues yet they were still on more simpler uh they didn't have interposers and all that stuff yet corre but now five years later it really looks like this this approach is at buckling Point whereas all the advantages for wafer scale is more apparent so if this seemed difficult and we have proof that it was difficult for NVIDIA to get that done with two we actually made it happen for 50 quite a few years ago right up to 50 one of those right it's like 50x this size and we got it solved with a simpler approach um a more yielding approach and at the package level we have similar components but fewer of them and less sensitive to manufacturing processes for alignment purposes and so on so for example our Wafers skan engine still needs to be cooled between a cal plate and still be powered uh from a a printed circuit board same as as a GPU would need however our wafer scale engine is very flat it's a single piece of silicon that's made very very flat which can easily be made it with a cal plate that's also very flat and we can use a uniform thermal inface material uh that allows thermal connectivity between the wafer and the C plate while on a GPU or any Rec constructed uh processor like this you have various Dy that might have different heights but most importantly they might have different Power consumptions with different thermal expansion which causes this to be at slightly different elevation so you need a a thermal interface material here that needs to have vertical compliance another problem is with the CTE that you have here on the organic material and uh the Silicon that needs to be bonded on it the entire structure expand in contracts at different rates than the actual coal plate the coal plate here if it is copper it has a CTE of about 17 parts per million per degree C oh wow so we have 17 we have 10 we have 2.6 with have all these materials Contracting expanding at different rates and the printed circuit board is mostly the same as uh as copper uh with a CTE of 17 uh uh PPM per degree C so they mitigate the CTE mismatch with solder joints with an organic material that has an intermediate CT and then trying to mitigate all that with rigid connections semi- rigid connections uh both on the thermal and the electrical side we don't have to do that we use the thermal interface material that can actually slide basically we we knew from the get go that we would have that problem I see right the the substrate expanding at a different rate so we made it capable of sliding so that thermal interface material for us is a sliding one same for the electrical material here in the center of the waiver you have direct connection but then on the edges you might have a different thermal expansion of the PCB that causes this connections that are initially vertical with temperature to uh start to to bend we invented a connector that can actually cope with that um deflection that CTE expansion you invented a flexible connector one that a connector that can actually allow the two pieces to expand at different rates while maintaining connectivity in this case they can't and this is part of the problem is if you have such hard bonded component and you heat up or cool the structure it will bow it will start to move and sometimes it moves so much that it start to crack it cracks connections or it cracks holder joints it delaminates and this might be part of the problem that they're having where maybe some of the connections are maybe not made from the get- go but also break over time in the system yes and here what we have is something that on both ends allows this wafer to float and be completely uh independent of these City mismatches I so it looks like five what Conclusion six to seven years ago You' considered what happens when AI reaches scale yes um the chip is designed at a dramatically larger scale and I think finally now with Chad GPT and LA things it all makes sense because it's at the right scale it's gigabytes to gigabytes and we're the only chip with a petabyte scale compute and paby scale memory bandwidth and never been done before um and on the on the system architecture side pretty much all the the issues that that uh the the black ball architecture is running into um were kind of anticipated and solved ahead of time well ahead of time and the benefits only now become correct correct this is a a little bit something that uh takes time to to for people to realize right when you are deep into your own world of traditional chip making and you face a problem and maybe you have an easy way out of that problem you keep doing the same thing every step is easy you just continue but at the end it becomes a very big Challenge and if you don't step out of that mode of looking so close and looking a little further you miss it yes and you you then continue to propagate challenges when it could be more easily fix awesome great this has been super helpful JP thank you for the session um we hope this has been informative and maybe we'll do one again sometime thank you so much it's been a pleasure okay byebye

Cerebras Co-Founder Deconstructs Blackwell GPU Delay

Share your thoughts