 OK, good afternoon and welcome. We're ready to get started. There's no amplification in the room, so this mic is only for the remote people in Zoom and for the recording. So as usually, we'll try to record everything in the past that has always worked. We'll see if it also works today. Welcome to the eighth EasyBuild user meeting. We had two fully virtual editions behind us. Hopefully, that was a temporary thing. We'll have to get back to that. I'm very happy to be here in London and to see all of you here. We're ready to get started with the first talk. But York has some practical announcements first, some important stuff. Yeah, I also would like to welcome you here at Imperial. I hope you find the venue suitable. Dinner is lunch has been served outside. There will be lunch tomorrow and on Wednesday we've got two coffee breaks, one in the morning, one in the afternoon. Bathroom is a little bit further down to the left-hand side. We do not expect a fire alarm. So if the siren goes off, this is then a real alarm. Please leave the building and assemble opposite the entrance. I would appreciate if people actually do that because that makes it easier to have a head count in case somebody is fainting in here and the number is wrong. The fire brigade comes in and tries to rescue that person. If that person decided to go into a pub, the fire brigade still comes in, but it doesn't look so good to put it like that. Apart from that, I mean, there is no planned fire alarm, so... Are you paying? That's basically all from the organization. Do you think there's something about Ian? Yeah, we can very well know Ian's name. Yeah. And I hope you enjoyed the day. So, Ian, floor is yours. Yeah, so a quick word about Ian. This is the first easy-built user meeting where we have a YouTube influencer giving a talk. So I reached out to Ian out of the blue and I think we're very lucky to have him here as a keynote speaker. Some people may know him from his YouTube channel as TechTechPotato, and he has more stuff like this going on. So thank you very much, Ian, and floor is yours. Thanks, Ken, and thanks for inviting me. I'll be honest, I've only given talks like this twice. You say influence. I was a journalist for 10 years. That counts for something. But so I'm not an easy-built user. This is not my standard foray. I'm normally a semiconductor specialist. I speak about chips. I speak about architectures. And what was asked of me in this presentation was to describe some of the emerging technologies and silicon for high-performance computing. We're in this weird situation now where we've had a good steady 10, 15, 20 years of x86 and then GPUs and our arm, and everybody's comfortable with that. But we're essentially coming to an era now where there are more chips that can do more things than ever before, and some of it's driven by AI, and I'll get into that. So in this talk, I want to speak about this new HPC era. We're going to go over the types of legacy hardware that we should all be familiar with, and I've got a few samples to pass around for you to look at. I do need them back. And this is one of the benefits of actually being on site. You actually get to see some of the silicon. I'm going to go through some of the new paradigms in computing. Some of them are new, some of them are old, but they're getting a new burst of funding because of all the new technologies we've developed. So analog, neuro, quantum, optical, there's a few there that I've probably been missing as well. And then I'm going to speak about why AI chips matter and this push to low precision and how low precision is actually being implemented in HPC as well. And then I want to go through an extended discussion about AI hardware because it is a very vibrant, fast-paced market, not only for the silicon but also the software. We have new AI models coming out every week and everybody trusts them or trains them or tests them. And then we have GPT and chat GPT and open AI and all these lambda funky things coming around. But I'm going to focus mostly on the AI hardware and how it pertains to HPC. So you can see some of the architectures that are being involved and are being put in play here. I'm going to go through some case studies. There's a few chip companies that I work with quite closely that I can give some insights in, some roadmaps into the hardware and also a little bit about the software stacks. One API, ROCUM, some of the specific ones, I did actually reach out to AMD to get an up-to-date version of ROCUM because apparently that was quite well requested. Unfortunately, we never made that happen but I'll tell you what I know. A little bit about me because I wouldn't be an influencer if I didn't do some self-promotion. My name's Ian Cuttris. Even though I say influencer, I'm actually an industry independent analyst. I work with companies on their technical messaging and try and guide them to describe what they're doing in a way that appeals to a technical audience. Very much these companies in their marketing teams, their marketers, they're not technical and there's some disconnect between engineering and marketing. So I help a lot of companies through that. And then, as I said, my YouTube channel called Tech Tech Potato on the right here is a video I did with IBM. They took me through their quantum roadmap strategy. That video is approaching half a million views. It just so happens I posted at the same week of the Nobel Prize went to quantum computing. So that kind of helped. On the left, I'll talk about it a bit later but this is a new project that I'm doing with Sally Wood-Foxner from E-Times called the AI Hardware Show. 12 episodes, each episode we cover six AI chips that are exciting to us. But I really like it. I actually published the after show second episode podcast this morning. So I'll have to see how that's doing later. And all my background is essentially in computational chemistry as well. I was programming GPUs back when it was created 2.0. I haven't really touched them since, but there we go. And yeah, so let's start off with this title of Silicon or Survive. We're in this new HPC era, HPC era, but let's go through the legacy types of hardware. So CPUs, everybody's familiar with X86. I hope everybody's familiar with Intel, AMD, the not really centaurs just being bought by Intel. And then we also have ARM, upcoming lots of ARM HPC chips as well. Ampere, Fugaku, Nuvia is the exciting one from Cork. I'm gonna get into a bit of that. And then I don't have a slide on power, but power still exists, right? Intel CPUs. So the latest generation Intel CPU here is Sapphire Rapids. We've gone from lakes and we're now on Rapids. You're probably very familiar with Skylake. That was the very popular 2016 architecture that powers a lot of supercomputers today. Sapphire Rapids is new and exciting because it's using chiplets. You see on the top left, we've got four massive chips. That's 1600 square millimeters of silicon. They can only do that using extended advanced packaging because your limit per chip is actually about 800 square millimeters in modern manufacturing. So with that, they can put 56 cores on a chip with a bunch of additional accelerators. On the top right and bottom left, you see these little gray bars that are beside it. That's HBM. So they're now putting HBM on their CPUs up to 64 gigs. These are special models specifically for HPC. You can run these CPUs without any DRAM. So we're going to see some installs that are highly dense without any need for DRAM. Can't you look at the cost of a modern server these days? If you want two terabytes of DRAM, your main cost is actually the memory. CPU is kind of inconsequential whether you get 16 or 64 cores or what have you. So this is a way to kind of introduce additional density into a modern server platform. And on the bottom right here is the Aurora configuration, which is two sapphire rapids, CPUs connected to six of their new, they're calling it GPU Max. Everybody calls it Pontavecchio, but we'll get into that. And I've got some numbers as well afterwards. And this is AMD's Genua. So AMD are now back in the X86 game. If you've been asleep under a rock, they now have anywhere between 20 to 30% of the server market, depending which analyst you believe. And their approach for the last four generations now has been this chiplet design. They're doing a more basic version of advanced packaging and Intel, but it means that they can have up to 96 cores on a chip now. These are high-performance X86 running at four gigahertz cores for about 350 watts. We're seeing a lot of interest now in AMD because they've got four generations of their new high-performance architecture under their belt, pushing the frequency. And obviously, if you're familiar with the top two computers of the day, a few of them are now using Genua. And this chip is absolutely massive. This is where I should. Right, so one sapphire rapids, CPUs. For those of you on the stream who can't see this, what these companies now like to do with people in the press like me is hand out basically dead CPUs encased in Loosite as a paperweight and basically just to show it off because now, because media is moving from written to a more visual medium, this is why I've got into YouTube and video. They now essentially want to show these off to everybody just so they can show something on stream that isn't essentially behind a glass case at an event once in a while. So now those sit up in my office alongside a few others. I highly encourage companies to do this if anybody is watching and very much like silicon in Loosite is very fun. So X86, let's move on to ARM, the big ARM chip in HPC that everybody knows about now is Fugaku running so many cores, high performance, number one computer or it used to be. And this is a fully custom design ARM version 8.2 but the key thing here is the scalable vector extensions. If you're familiar with Intel and AVX 512, this is essentially that on steroids with HBM and Fugaku has been involved in Gordon Bell prizes and research into the COVID-19 pandemic. But this is an ARM based CPU using standard ARM instructions. And if we actually look at the top 500 list, I decided to go back 15 years to see where the top computers are and what architecture they're using on the CPUs. This is all 500 systems. If we look at 2018, Intel's about three quarters. AMD's 12% power is 12% so IBM's still there and there's a little bit, there's a spark left but that's the final spark. Going forward to November 2022, I know there's gonna be a new list coming up in saturation computing here in a couple of weeks but Intel's roughly stayed the same. AMD has taken essentially all of power. There's still a couple of power systems left powered with Nvidia GPUs as well. But yeah, just showing where we are in the top 500, Intel is still predominant in the systems that are currently installed. If we look at the new systems, it's more pivoting towards AMD but Intel still has the majority of it. This is because Intel has roots deep in partners like HP, Lenovo, Inspire, all the good stuff that supplies some of these CPU systems. First, CPUs. Let's go on to GPUs. Nvidia, AMD, obviously the big ones and I've got Intel here as a question mark because they're coming thick and they're coming fast. Nvidia, everybody should be familiar with Nvidia using the software CUDA and PTX instructions. Latest generation chip is the H100, H4Hopper. 80 billion transistors for an animator process. This is actually, I think, an AI version of Jensen. They did one of their presentations where it was mostly Jensen trained by AI so he didn't actually need to do much for that presentation. And if you watch an Nvidia presentation these days, they will have things like, lots of special AI driven special effects. So they'll have human segmentation. They'll change the background. So they'll increase the number of spatulas behind him because he does some presentations from his kitchen. I think with actually this system he took it out of the oven because he said it was ready. We're currently seeing these on eBay selling for about 40,000, the Hopper GPUs. But it's a massive chip, lots of HBM memory and people are clamoring hand over fist for them because they're really good for AI. AMD's latest chip is the MI250X. It's a big chip because it's a bit of a cop out. It's actually two chips packaged together with lots of HBM. It does mean that this is normally the FP64 leader in compute, I think it's some like 93.7 teraflops, theoretical peak, actual peak, you're probably getting around 50 to 60 on a good matrix multiply. But the point here is that they're doing this for density, not necessarily for programmability or ease of use. This chip still looks like two GPUs in the system. There's no real benefit to any additional bandwidth between the chips. And the standard CrayShast system at the top there is two CPUs and essentially four MI250Xs. On the bottom left is the MI300 generation, which we'll get onto later, which should be coming out later this year. Now Intel, Intel's a bit different. Intel is always a bit different. Having not really been in the GPU space for quite a long time, headed up by Raj Kaduri, who has now since left, their HPC push on the GPU space is with their architecture with their chip called Pontevecchio or GPU Max. This is an impressive chip because imagine what we saw with packaging with Genoa and Sapphire Rapids and then put it to the power, this is 47 tiles. So where we saw Genoa the CPU with 13, this is 47. Some of it is stacked vertically, some of it is horizontal. And if you have a 1% package loss yield per chiplet, this means that this would yield about 63% to 47 of them. They don't do it that way, but this chip is designed to have essentially equal 32, FP32, FP64 performance around the 52 teraflops at about 600 watts. And it also does integer eight performance for low precision, which we'll get to in a bit. This is what's going in Aurora supercomputer, about 20,000 of them. And we'll see this in a second, but again, systems in the top 500 to do with accelerators today, 163 of them are in video over 90%. AMD's got 5% and there are a couple of others like the PZ stuff and what have you. If we look at the RP compute in exaflops, just the sum of the systems. Nvidia and AMD are about equal because the AMD systems are massive installs. But if we add Aurora to that mix, I actually get thirds between. So if you're planning your compute strategy based on the compute sum in the market, you'd say 30, 30, 30. But in fact, it's 9% Nvidia still. So the other benefit of it that Nvidia has is that they have 10 times more engineers than AMD. So any feature you want from an AMD system, Nvidia is gonna have 10 times more people working on it. But it's fun. It means that there's competition. It means that there's gonna be a lot of hardware touring and throwing. Everybody's gonna be trying to get a good deal on their hardware. Everybody's gonna be pushing for power efficiency. Whether it works or not is up to you guys. Yeah. And then a little thing about FPGAs because there are some FPGAs in the mix here. Two big companies, historically Altera and Xilinx, now owned by Intel and AMD respectively. They just needed that additional portion of their portfolio. AMD's portfolio is made up of Vertex, FPGAs. These are the big EDA FPGAs, the ones that they do chip simulations on. BU19P when it was launched 2017, I think. It's still practically one of the biggest FPGAs in the market and is very useful for that sort of thing. We now have at the bottom left, versions with HBM. So they clearly know that they need high performance memory next to high performance compute. Field programmer, gate arrays, you can essentially program what you like. But down on the bottom right here is their new generation of products called the ACAP. That's all compute advanced platform or something similar. Instead of having a pure programmer logic solution, you have some programmer logic, then you have some DSPs that are hardened AI accelerators. You have some ARM cores, some real-time ARM cores and a lot of fixed IO, if you need the memory, if you need PCIe, if you need security, if you need accelerated crypto engines. So it just reduces how much of the field programmable capability you have and provides something that a lot of Xilinx's customers want. These are still very expensive as well. Intel, on the other hand, their product family is called Agilex and Agilex had a bit of a rough birth. So when Intel purchased Altera, that essentially said scrap what you're doing, build everything for Intel 10 nanometer. If you know the story about Intel 10 nanometer and how it struggled to get out of the door and it's one of the reasons why Intel is now behind in manufacturing, it meant that essentially Altera division didn't have any brand new products for several years. They are now slowly coming out on what is called Intel 7, which is the third generation of Intel 10. A lot of people still call it 10 plus plus if you're familiar with them in Glitcher. But they now have these FPGAs for low performance, mid-formance, high performance, some with HBM, some with advanced networking. And the idea is that with their advanced packaging technologies you can see here in most of the pictures, you can attach 400 gig ethernet for additional thermos templates or PCI templates or transceivers, basically big, wide SIRDs links. And in the top left, you've also got CXL, which is a future technology. I would speak about some of the other FPGA players like Lattice, though they are decidedly more in the mid-range. But what FPGAs actually get most for you use a lot for in modern HPC is something called SmartNIC. So I've listed this here under the A6 section because there's a range. And this is the 2021 Serve the Home, which is another media organization you should absolutely be reading. Their network interface card continuum. So what is a SmartNIC? A SmartNIC is your network card that does smart things to the data as it goes through. So it's not simply a quote unquote passive device. Maybe you will need to do some advanced routing on the networking card or you need to do some encryption, decryption. Maybe you need to do some packet analysis of the data coming in and out of your system. If you're talking about big HPC system, then routing is a main thing to make sure you have optimal bandwidth in the cloud. So AWS is your Google. They'll be doing this for additional security and their quality of service, making sure they meet minimal delays for their clients. And is it deployed in use heavily in private clouds and private systems as well? For example, there's a setup using smart networking where instead of transferring data between two systems, it only transfers the difference of data between time steps. And you get additional compression if you only say you've got 0.001. In decimal format, you can, or in a floating-point format, you can compress it a lot better than if you're actually sending over 12.386, whatever. So smartNICs can help with that as well. And actually I'll get on to the system that uses that. This is another ASIC. If you've heard of DE-sure Research, they have an ASIC called the Anton and they announced Anton three a couple of years ago now. This DE-sure Research is headed up by his hedge fund manager slash scientist who made his money on Wall Street and now just has a research department that is his plaything to write research papers for molecular dynamics. And as part of that, they built their own silicon just for molecular dynamics. I'm not talking FPGA, they actually taped out with pure ASICs. And molecular dynamics, obviously you're fighting against all the maths that's involved plus how many atoms you can put in a simulation. And there's a video of this talk at the Hot Chips Conference. I absolutely loved it because my background is in chemistry. But the whole point is they're trying to do more than what you can do with any other hardware that I've mentioned so far, this talk. And the key graph is this. This is molecular dynamics performance. So if you have your custom silicon, simulation slides on the X-axis and the performance in microseconds per day on a log scale on the Y-axis. And we see here that the best GPUs are that gray band. The best conventional supercomputers are the green band. And you can do 100X more with larger simulations using custom hardware. This is the benefit of having a custom ASIC. And this is also one of the reasons why I'm gonna be speaking about the AI hardware because if you can take benefit of it, the speed-ups are enormous. And so let me go on to a new paradigm because those are all classical compute situations essentially even with a custom ASIC. We've got new slash old series computing or analog computing. So the idea is you have a digital analog converter pass it through a matrix of resistors with essentially known resistive values and you get your values out. Lots of benefits, super low power, super low latency. And arguably you can have any value possible. What happens is you actually cut off your values if they're depending on your conversion ratio. And that's where some of the issue happens. You also get elements with nonlinear responses. Can you predict what your modification is actually gonna be in the analog domain? And then scaling these things out, it has always been tricky. Key players, if you're interested in the companies involved in this, Mythic AI, who were dead and now back, IBM's also working on some analog stuff now and Aspinity is a fund-owned company that tries to do a round of actually doing the digital to analog conversion. So you actually have an analog input from a sensor. That's fun. Neuromorphic computing, and I'll caveat this with a warning because neuromorphic in the UK especially has been part of the ecosystem for so many years. If you've heard of Spinnaker, that's what the thing on the right is. That's using spiking neural networks. So it's trying to act like the brain with actually spiking neurons and axons and you do your calculation and then turning on what happens, you get your output, which is also a spike and it's also in the time domain. There are companies today who have a neuromorphic in a name who aren't doing anything like this. They're just doing pure digital compute and the reason why they call themselves neuromorphic is because they are brain inspired or screw you. Key players here, the ones that most people have heard of is the Intel Luigi chip. This is normally a research-only chip right now. You have to be involved in their Intel labs program to get ahold of it, but they just launched a second generation. It's built on Intel's most advanced process node technology. I think I've actually got a slide in here about it and then we also have Spinnaker, which I think is in Manchester. So Intel Luigi too. So normally in a neuromorphic setting, you'd think of it kind of like analog computing with spikes coming through and seeing peaks. That Intel's able to do it in a purely digital domain but still have that spiking behavior and the key metrics here on the bottom left, if you guys in the room can't see it, but maximum neurons, we're talking about a million neurons per chip, 100 million synapses per chip. And the point is if you have 100 of these chips, maybe you can simulate a mouse brain or at least have as many neurons as a mouse brain in a pure 2D fashion. I spoke to, we'll see him in a second, but Jim Keller, who's a very famous chip designer, he told me that the brain, if you look at the neurons, rather than it being a 2D mesh, is actually about six neurons deep. So there's some 3D. So there's still that to do on the neuromorphic design. Maybe on my pay grade, but quantum computing, I'll use this image again because I like the fact that this video did really well. But quantum computing is the technology that everybody says, well, when is it going to be ready? What can it do? It's the 10 year technology that will be ready in 10 years time. But the simple answer to what it can do, there are three main areas. One, the physical world. So physics, chemistry, biology. Second is math and encryption. So the big example use case, for example, is Shaw's algorithm in order to break symmetric and asymmetric key encryption. And then we have machine learning. And obviously you've got to have machine learning on a slide about computing because we are in the machine learning era. But there is work going on about this. Again, I partnered with IBM for this video. So I'm very familiar with IBM's function right now. They have over a thousand research papers currently using quantum computers. There are 20 quantum computers in the cloud that people can sign up to use and pay to use. There are even certain amounts you can use for free. They have an online simulator. So if you want to simulate up to four qubits, thankfully all the situations where four qubits will be useful have already been known. So it's just a lookup table. But Intel last year said that the problem with simulating qubits is if you add another qubit, you square how much compute power you need. So they were able to say that they were able to simulate 44 qubits using most of India's cloud system, some 300,000 systems. But reality is we need about a billion qubits. You're not going to simulate that very easily. There's a high barrier to entry and it doesn't really do any other mathematics. The point here is that the way you approach problems changes so significantly. It's very hard for classical computing experts to transfer over without a deep course in quantum mechanics. But there are several types of qubits available. Like I said, it's a very active research area, even though nobody's really making any money on it right now. I think the key thing to take away from this slide is what's on the left. You have key metrics in quantum computing that matter when you're doing your compute. You have coherence time. So how long your qubits stay relevant before you essentially have to spin them back up again to make sure that they're coherent. Iron trap is very good. It's over one second. Superconducting, so that's when you're pulling it down to 10 milli-Kelvin. We've actually got that up to milliseconds now. This is my next slide from a few years ago. Semiconducting, so this is using standard manufacturing processes that we see silicon on today with spin. And that's 28 milliseconds and then NV sensors is this one that's really interesting much in. But the point is qubits and scalability. The number of qubits that you can do on each process right now reliably, we'll get to see a slide in a second, but superconducting is now up to about 400. But it's a scalability you need to focus on. If we need a billion qubits, you need something that scales. You don't have something that scales. There's no point in researching it. And Google did a great slide here, which I've copied. Thanks Google. Where the problem that you have with these qubits is error correction. They, especially the superconducting ones, the environmental noise matters because that disturbs what your qubit can do. And then you essentially need to collapse the wave function and bring it back up again. In order to make sure that doesn't happen, you need some form of error correction as with standard memory would do. But you're looking at a ratio of about 100 to 1,000 qubits, physical qubits for one logical qubit. So when I say you need a billion qubits, that gives you a million actual compute qubits to work with. And this is why you need scaling. And if we look at the Google's here, so in 2019, they had 54, which could theoretically go beyond classical compute. 2023, you have 100, 2025, you have 1,000. And then the timeline will get longer and longer because we have to discover how these things scale. And this is actually doesn't come quite well out on this screen, but this is where we are in IBM's Quantum Roadmap again, because just because I'm more familiar with them. The end of last year, they announced Osprey, their 433 qubit chip that goes in one of those massive dilution refrigerators down to 10 million Kelvin. And they do something called a heavy hex architecture where the qubits can only be entangled with the ones next to it. And then each one has to be profiled by the residence time. And if you've ever done any testing of memory and how long a memory cell can keep its charge, it's kind of like that. You have most can do pretty well, but there are some that are really bad. So when you have an architecture like this, there's kind of, you have to get around the redundancy and it's quite difficult. But they have a strong roadmap through to 23, 24, 25, 1,000 qubits, all that 4,000. And the key differentiator here is gonna be how you take one chip and connect it to another. Right now, all that we've been doing is getting bigger quantum chips, but that scaling ends. So at some point, you have to be able to connect it to another system. How do we do that and maintain the expansion of qubits that we can use? The other thing that's not mentioned so much here is the software stack. IBM has an open source software stack that's called Kiskit, which is online free to use. And I didn't put it in here, but they run over, I think it's over 10 trillion circuits in the cloud in the last three years. And they, more users, lots of free time or you can pay time. I think it's something about $2.70 per second. But in that second, you can do say 10,000, what they call shots of just pushing the data through. It's expensive, but fun. And we're waiting for it to actually cross over where it actually comes viable. Now I want to move over to optical computing. Optical sounds fun. It's the speed of light, right? If you can compute the speed of light, then we don't need anything else. Compute happens because we can move light through silicon in something called a waveguide, which is what you have on the left here. This is about 200, I forget scale where this microns on nanometers, 200 by 100. And the idea is that with the waveguide you can push your light through. And in order for it to compute, you need a switch, transistors and switches, right? And the key switch that most people are using today in optical computing is the maxiometer, mzi, mzi, sorry, I'm too used to speaking to Americans. And the whole idea is that if you split your light and you apply a voltage bias on one side, you can bring it in and out of phase and that's your switch. So here we have a differential phase diagram and if you get the voltage exactly right, you can either have full power or no power and then that data can continue further down. And you put a lot together and you get switches, transistors, like compute platform. And if you actually look at some of the diagrams, so this is a company called Lightmatter on the left and Light Intelligence on the right. Light Elegance, I should say, that's a terrible name. And if you actually look at some of the diagrams of how these work, the Lightmatter front panel of their server is actually quite fun because that's kind of what it looks like. The idea is that you have this single beam of light and it's either switching or not switching. From me is you're limited to essentially one bit compute almost. And the benefits, no power and speed of light fast, that's pretty good, already speed of light and silicon. The problem right now is manufacturing and it's scaling. If you're dealing with this, we're currently dealing with transistors on the nanometer scale, not five or three nanometers that people talk about, they're just names, but you have gate widths of 26 to 40 nanometers. If we're dealing with waveguides in the 200 range, then you don't get that level of density. And lots of research is being done on how to make adders, subtractors, multipliers with light. The problem is with those interferometers, if you mis-tune it so that you only get 98% of your light through, cascade it down a thousand, 10,000 switches, you've now got no light because they're all out tuned. So that's why manufacturing has a serious issue here. And this is example of light matters chip Mars. It's built in 12 nanometer low power. I think it's the global counter is actually, so one gigahertz, but it does a 64 by 64 bit matrix at 200 picoseconds on 150 square millimeters. Power sounds great. Latency sounds great. Density doesn't sound great. And then you've also got this laser coming in off chip because it's not actually generated on chip and you have to power the laser, but we don't often count that. The push in the research is that eventually we'll be able to, well, we can generate light on the chip and then essentially dissipate it on the chip as well. But scaling that down and actually doing it in a powerful way is difficult. I realized I'm 40 minutes in and I'm only a further way through my side. So push for low precision. I've gone through a lot of chips here that deal with FP64, FP32, and I know that's what the HPC community is comfortable with. But I wanna talk about quantization. Now, what does FP64, what does FP32, what does FP16 actually mean? Well, it determines how, well, your range of your numbers you can represent are and how accurate they can be. So we, FP32 and FP16, we have a sign bit. We have an exponent and a mantissa. And if I bring up this slide, it might make more sense. So this equation is almost correct. Choose compliment and other stuff is, makes it slightly different, but whatever. But it's all defined by IEEE 754 standard. You know what it's gonna do if it errors out and normal, subnormal. But the important thing here is, is actually a part of taking this from a Qualcomm slide. If you have a 32 bit floating point number, 3452.3104, maybe you can represent that as an 8 bit integer. So you've reduced number of bits you need to a quarter and it's still roughly the same. Assuming your calculation is amenable to losing a little bit of precision. Which is actually what AI is. And one of the problems with quantization is how you actually represent the numbers. So any number format, you can't represent all numbers because you're limited by the fact you don't have a million bits to represent it. So you have to either clip it or round it. And when you trans, when you type cast from an FP32 to an integer eight, you're gonna have some loss in that level of accuracy and it depends on, especially with floating point, what format you're going to and from. Now these are all the floating point formats I found. And I can tell you what they're all used for. So FP32 at the top, standard everybody should know that. TF32 is NVIDIA's TensorFlow 32 bit representation. What it's done is it's cut down the exponent but you've still got, or you cut down the mantissa but you still got the exponent. FP16 is one that's being researched in AI right now. And you have BF16, which is Google's brain float 16. And what they've done is they've taken bits and put them on the exponent because you get a different level of range of accuracy and that's better for machine learning. FP12 is a weird one that I found. FP8 is what's being researched for right now in AI. And you'll see these representations of E3 and 4. So exponent, three bits, mantissa, four bits. And with FP8, if you change how many bits you use for each, you can either focus on precision, range or something else. Tesla is using this quite extensively in their chip Dojo. FP24 and FP21 are internal GPU formats, GPUs that deal with visualization vector compute or sometimes use these internal versions. MSFP12 is Microsoft's FP12 format. What you have is essentially four bits. So an FP4, but then a fixed 8-bit mantissa for all of them. So you're essentially defining the range in which those FP4 numbers can run. And then you have FP4, which IBM is researching. FP2 technically exists and technically so does FP1, which I think is just a sign bit. But we have IBM leading research and quantization. They're looking at it for artificial intelligence, the idea being that, well, if you have 8-bit, 4-bit, 2-bit, rather than having the speed of 64-bit, let's amplify it out. But in terms of HPC, all HPC runs a FP64, FP32, right? No. So on Eisenbard, Bristol, FE16 on climate and weather, research was done to speed up versus FE64 and also maintain accuracy, though, that's not on the graph. This was done with the A64FX chip that's in the graph. And the reason why I bring it up is because this is now a very innovative space in HPC. Can we apply reduced precision to our HPC workloads? If you look at international supercomputing coming up in a couple of weeks, these are all the talks on reduced precision. And there's a name on the bottom right called Jack Dongara. He's pretty important and he wants to talk about it. So I think everybody should listen. And the point is, if we've got some fluid dynamics or some more fluid dynamics or you're doing some particle or you're doing heat transfer simulations, maybe out here, you don't need FP64. Maybe you're happy with FP8. So you can speed up that bit of the simulation. Yeah, if you write close FP64, but if you have a library that does the right amount of quantization at any given point, you can speed up your simulation and be happy with the accuracy of the result. This slide, apologies, it's come out pretty bad, but this is taken at ISSEC and other IEEE conference. And it's showing your exponent and mantissa for different types in the middle, but on the right, it's the amount of flops taken per joule. So how many flops can you get per unit of energy? And if you can run your simulation in FP8, then you get 32X or BF16, 12.5X versus standard FP32 and FP64. So this is why there's a push for quantization. You can go through the spec sheets for NVIDIA GPUs. They show FP64, FP32, you've got BF16, FP16, Intate, Int4. Here, 2,500 tops of Int4 performance. That's quite a lot. But yeah, so that was quantization. I hope none of that was too new for you, but the reason why it's driven is the AI hardware market. And we have established players in AI hardware. You should all know these companies, I hope you do. But some of the chips I've gone through, some of them I haven't. So we've got NVIDIA, POPPA, Ampere, Volta, Pascal maybe even use these days for training. AWS has its own training chip and there's rumors that they're working on something new recently. Google has this TensorFlow processing units before V3. And Intel has at least five that they've been to, and these are the five that they actually use. Inference is different workloads. So training versus inference, I haven't really got a slide here, but training is the compute heavy one. Inference is the one that actually works on your phone, just putting into context. And we've got a different hardware here. NVIDIA does some very specific inference hardware, V810, the T4s, and Intel has Greco, but all the CPUs do it, all the GPUs do it as well. I kind of wanted to limit this to ones that you can buy on a shelf. Obviously you can't buy the Google stuff or the Amazon stuff, but the rest you can. A lot of the chips I'm gonna talk about now will be limited to very specific installations, but will be quite important. And the point is that there are over 50 hardware startups doing this. I think at least 70, if you go down into the nitty gritty of edge computing, there may be over 200. But the market for investment in this is over 10 billion. And this is not the latest numbers, but from September last year, how much money is in all these companies that do AI hardware? At the top is Horizon Robotics, which is a Chinese company doing smart city stuff at 2.2 billion, Samba Nova over a billion, Cerebrus, Graphcore, UK-based, on-flame and peer computing, GROC, Sci-Fi, NVIDIA 10th parent, and a few others that are playing around with VC funding to try and get... Well, we'll come back to this slide a few times, but going through some of the chips here, so Google, this isn't the latest, this is the V2, but the V4 is kind of similar. The way their architecture works is that it's a massive matrix-multiply systolic array. So it's like a heartbeat. On every clock, the result from the output of one multiply accumulate goes into the next, goes into the next, goes into the next. So you're not doing an operation, going back to the scheduler to sort something out and then coming back and then coming through. You're not going to a cache. You're simply passing the data onto the next element unit. The version one had one big 256x256 matrix-multiply array. Version two had 128x100, I think V4 has four of them. The point is, because AI is dealing with lots of matrix-multiply, a matrix-matrix operations, you don't need a matrix too big, because that weighs power, but too small and it becomes less efficient. And Google, one of the first to be working with reduced precision as well. That's why they invented the BF16 format. And, you know, a system kind of looks like that, which is goes into their custom racks. And, or you can rent a pod. This is 64 of their TPUs. I think their V4 pod, you can have 4,096 now. And it is petahops, exops of AI-based compute. Next, I want to talk about Samba Nova, which it is a company you may not have heard of. Most of their customers are in the defense sector, but they make a chip as big as your hand, which is quite substantial. That's the SN30. It's the SN10 that they've actually been selling recently. Seven nanometer, and it's, their architecture is what's called a CGR array. It's a coarse-grained reconfigurable array. I think FPGA, but a bit more restrictive, a bit more ASIC-like. So each one of these elements can either be a floating point unit or memory or a little bit of both, but there's some defined, no-work architecture. And the idea is that when you have an operation rather than the systolic array just essentially pulsating through like a wave, you can have a calculation that just goes between these four elements. It can be programmed at compile time to just go around. So it looks like lower utilization, but you get compute density as a result. And in the map of where it, what it looks like compared to CPUs, FPGAs and ASICs, the general consensus is that CGR is applicable to a lot of aspects and like I say, their big customers are in defense, so defense kind of likes them. Defense also likes FPGAs as well. But they're starting to now get clients in academic situations. So that's another example of what their specific CGR architecture looks like in their first generation. They should be coming up with a second generation kind of soon. And the idea is, yeah, you can also bring your algorithm through different elements. So we've got the first orange area is compute, and then the blue area is memory, then through more compute, then memory or it's doing normals and sums and min-maxes and what have you. And yeah, Sam and Over just got a system added to Fagaki to help boost AI, help boost performance of the supercomputer. The idea is that if you have a large search base with your compute, maybe you can condense the search base down by using AI. We're gonna see another chip side differently that is attached to us in future in a different way and that chip is rubrous. Now, what's the biggest chip you can think of? Is it that big? That's one chip. Size of a dinner plate, I say it's as big as your head. It looks like there's like, I don't know, 64, 70 little chips on that. What they're doing is they're working with TSMC and they have custom interconnect between the chips that they actually put down, which they've got patents for, and it means that they can make one big 24 kilowatt chip, which means you don't have to go off chip if you need more memory or if you need more compute. So they're seeing good examples. If you use NVIDIA GPUs and you realize, oh, I don't have enough compute, I need to scale out or I don't have enough memory to scale out. What if you could just put two GPUs on the same piece of silicon? That's what that is. The largest chip ever built, it has a place in TSMC's History Museum now. Built on seven nanometer, 46,000 square milliliters silicon. The sapphire rapids were 1,600 and the AMD one was about 1,200. 2.6 trillion transistors. If you hear Intel talk about having a one trillion transistor chip by the end of the decade, it's already here. 800,000 cores, 40 gigs of on-chip memory. Yeah, TSMC seven nanometer. And there's two of those supercomputing center. It cost them five million. So two and a half million each. Most expensive chip you'll ever buy, but also the single biggest chip you'll ever buy. They won't give me one in Loose Sight, by the way I've asked. They gave me one that was printed on like a metal poster that you can stick on your wall. But the point of adding it to at PSE is again for the AI that we discussed with the Samonova system in Fugaku, but they also do a stencil compute with it. So this is meant to be sort of a heat transfer between two planes. So you're literally doing standard HPC code on one of these massive chips because it's a massive 2D array of compute elements. If you can funnel your data into the next element. I think I've got a slide here showing that half the chip is actually cash. So if you're thinking about, well, how do I include time steps or halos in my stencil compute? You can just all fit it in memory. Yeah, so every core, every one of the 850,000 has 48 kilobytes of high density SRAM. Imagine having each kernel in your CUDA code with 48 kilobytes. Runs at 1.1 gigahertz. Each core runs at 30 milliwatts. But again, it's 24 kilowatts. So try putting that into your SLUR manager and see how many people like it. Tens Torrent, this is Jim Keller who I mentioned earlier. His work to Intel, AMD, Tesla, Apple, basically built all the big chips that now sell in the billions. He now runs a company called Tens Torrent, which is an AI startup with about 700 million in funding. This is me interviewing him at Santa Clara office with their second gen and third gen part. I'm happy to say that they're now a client of mine, which is kind of fun because they are at chip through the language. But the whole point of their chip is compute is fine, but the importance is gonna be networking. Each one of these chips at the bottom here, I think it's this generation, has 1600 gigahertz ethernet ports on it. And the idea is that you connect that to either 16 chips or 64 chips in either a big array, depending on your topology, such that you have a system that can look like a massive 2D array of tensor cores to do your AI compute. Obviously when you have 16 connections per chip, that's a lot of cables. I asked him how much the cables cost relative to the system if he said about half. But the point here is that if you're not doing way for scale and you've got to think about big compute, you need lots of networking. I mean, how many compute problems are bandwidth limited, either memory or networking? This was essentially designed to try and help with that. But they've got a couple of generations go further in the works, one of which has a high performance risk five core, which I've got a video on if you're interested. And yeah, there's a lot of money coming in. There's some money coming out. Okay, I took Graphcore out. So Graphcore, AI company in the UK, one of the first to the market with AI chips. Here it's what, 700 million in funding. The latest report said that they have a revenue of 5 million and a burn rate of 200, which should be the other way round. But one of the reasons with that is they were the first on the scene. They didn't preempt transformer networks, which are big in AI right now. They're more focused on the previous versions of network CNNs. So they need to pivot quite quickly into a next generation chip. But it is some funky things with packaging and some of us are hopeful. So in the last few minutes, let's go through some roadmaps on hardware. So Intel just did their DC AI, data center, artificial intelligence roadmap, showing that today we have fourth gen, which is a Sapphire Rapids I showed around. We've got a fifth gen coming. Next, I think it's gonna be called sixth gen, but it's the second line here that's gonna be interesting. So most of the time we're used to just having performance cores. One core across the whole chip and it doesn't matter where the workload rests on the chip, you know the performance of that core. Intel in the latest consumer products followed what mobile's been doing and they now have efficiency cores. These are lower performance, but also lower energy, higher efficiency cores. They're not gonna mix and match on server, thankfully, but they now are gonna produce chips which are just the efficiency cores. So these cores physically are smaller. But the idea is that you get more compute for dual and that's apparently what cloud providers have been asking for. Don't know how much relevance they're gonna have in HPC yet, but there's been such a demand for it that Intel and AMD as we'll see in a second are going down this route of having instead of 56 cores of your high performance cores that we're used to, you have 144 of your efficiency cores, which obviously has considerations of memory and networking and such. As I said, Intel has a lot of hardware going on right now. That was CPUs, GPUs, the Pontavecchio, the 47 tile one, they actually just canned the next generation of that called the Alto Bridge and now we're going on to Falcon Shores, which is meant to be this CPU-GPU memory mix. There's a lot going on. They have hardware called GALDI, which is being used by Mobileye right now. That's more very much AI-based FP8 and FP16-based and then they have the FPGAs as well. And they're also accelerating manufacturing. If you listen to the news, the five process nodes in four years, Intel's going from its slowest process node ramp to its fastest process node ramp ever. And the person behind it is a very strong-willed Irish woman called Ann Keller, her who I believe, because she's the person who speaks with authority. I have an interview with her as well, but you should definitely watch. Nvidia also have a roadmap, though they're not too keen on saying much about it. We've just had the Hopper H100 launch. They also are starting a new CPU program in the middle here called Grace. That's going to be ARM, Neo-Verse base. And then they also acquired Melanox recently. So now they also have Melanox networking that's going to be built into the GPUs or at least on the same card as the GPUs. And this slide is showing Grace Hopper. So Grace, their new ARM-based CPU, paired with Hopper, their latest launch GPU, you have a 900 gigabyte per second interface between the two. So you don't buy a CPU and a GPU now, you just buy a single one package and that's it in your system. Or if you just want the CPU, you can buy two, lots of CPUs, each one of these has 72 ARM cores. So you have a total of 444 cores and they've done some estimated spec rate numbers as well. The idea is that these will both be available by actually fairly soon. And if you want a diagram, it kind of looks like that. They're doing some interesting things with how each access is memory and unified memory tables as well. Which is going to be nice and complex if you have people sharing a system. AMD, so I passed around the Genoa chip we saw with all those chiplets. They're going to do a cloud native version with 128 cores. These are the smaller, more efficient cores. As far as, they haven't announced exactly what's going to be called, but we know, we think it's just going to be a reduced cache version because some workloads don't need that much cache. Genoa X is their vCache. So imagine having eight cores with an additional 64 megs of L3 cache on. So for your memory limited workloads, that turns out quite well. They did Milanex last generation and Genoa X is a new generation. And they're doing a telco version one, a telco focused version one called Sienna. This generation upgrade to DDR5. So you've got that update as well. No HBM version yet. Their counter to HBM is essentially the vCache which is a halfway house, but not really, but it's a lot cheaper. And then on the GPU side, we've got, I went through MI200, MI300 is their new APU. So that's CPU plus GPU plus HBM all in the same package. Kind of like Grace Hopper, but AMD's variant. And they're building the software stack to do that. That should be coming out later this year. And then we have ARM. I haven't mentioned ARM much. ARM don't speak to me much these days, but they have the V platform is their V series. That's for high performance HPC. N series is their networking. So Amazon's using a variety of these and E for efficiency. And then the other ARM users include Nuvia, Qualcomm, they're gonna be doing a chip, but they're custom designing their own core because of ARM's licensing. You either use one of theirs or you just take the instruction set. TenseTorrent also has a roadmap. I showed you Grace Gull and Worm holding that picture with me with Jim Keller. They're working on Black Hole and Grendel. And their whole thing here is that let's make a 128 core risk five chip for servers. And let's also make it a chip, which you can attach to memory or additional compute like GPU or AI. But that's more in the 2024 timeframe. And yeah, I'm gonna wrap up now. Software, CUDA, this is my week's box. I don't keep up with the software, but CUDA is extensive. It's been around for goodness knows how long. And this is really why NVIDIA have the hold in the market, lots of libraries. AMD has Rockum, if you must have heard of Rockum before, took a long while to get ready. It's still kinda, yeah, I'll leave it at that. But their key here is that they wanna translate CUDA code to AMD using HIP-FI techniques. The one thing I liked is the extensive release notes for each version, but I still get comments every time I mention it that it doesn't really work. So up to you. Intel has this one API concept where they're using a combination of C++ and CICL called Big Parallel C++. The whole concept of one API is very, it's a bit in the sky. It's right one set of code and you can compile that to any hardware underneath that you want with no additional optimization. As long as there's a, when you call a library, there's a variant of that library for the hardware that you're using. So they're putting lots of people, lots of resources behind that. They're making some of it open source and yeah, enable code reuse across architectures and vendors, CPU, GPU, everything, basically everything that Intel makes and Intel's trying to bring everybody else under their umbrella. So that means one API versions, all the software stacks, so yeah, write ones compile often, how hardware agnostic can it really be, but it has required a proper ground up redesign of Intel software support. And then you have the HPC toolkit and different licensing versions for all of these makes it all nice and complex. Speak to your supplier. And there are things covering this talk because of time and knowledge and talent interconnect. InfiniBand versus Ethernet versus GPU to GPU. There's a lot of discussion going on about this. IBM is working on data sharded parallel concepts where you don't have things like an old juice in AI integrated optics and photonics. I spoke a little bit about optical compute. What really excites me is actually optical networking and photonics getting chip to chip communications with light and light matters a good company doing that. And we have new paradigms like CXL where we can attach more memory to a system and do funky things like a big 42U servers, which is fun storage. I didn't have Optane, Optane is fun. I still want Intel to send me some even though it's technically a dead product now. And then HPM versus DDR and we're also seeing a class of computer memory. And then what hell is China doing? There are comments that they actually had the first X-Flock computer. Some people saw results, Jack Gon, Nongara, maybe one of them, but they didn't actually publish the top 500. And the numbers they do have in Gordon Bell papers are reduced precision numbers, which complicates the matter. Again, another plug. This is AI hardware show. It's me and Sally Wood-Botsamy Times. First episode, we covered these products. We're covering 72 products through the whole of the series. The idea is an episode every week followed by a 15 minute episode every week followed by a 30 minute art show, sort of free form talk. Second episode that just went out, that includes Graphcore and Intel GPU Max. And we've got things like Tessa Dojo, Cerebrus, all the other Intel products, all the other AMD products also coming out on that. And yeah, I'm Q&A, but because Ken asked for it. I wouldn't be an influencer if I didn't have merch. Unfortunately, I can't sell them here because of tax reasons, but 20% off code is what you hear. If you really wanna mug with us, take everything away from us. Thank you. All right, thanks a lot, Ian. We do have time for a couple of questions. I'm happy to feel some minutes of my own talk for that. Do we have any questions in the room? Yeah. I'll give you this silly small mic just for the remote. Yeah, another question for the mic. I think you mentioned 2 CPUs on Shasta M850X. That's not right. There's one CPU for 4 GPUs, but there's two notes on there. There's two CPUs on the package and 8 GPUs on a blade. Yeah, yeah, yeah. Because there's a very special connection between the CPU and the 4 GPUs. It's the trendsetters. Yeah, which is really just Milan with a very slightly changed IO die. Yeah, yeah. So on that, well, one of the things that Intel AMD do is they sell a lot of custom CPUs either variants, either variants. We used to custom CPU in a sense of different core counts, different cache sizes, and those get sold to all the hyperscalers, Amazon as your Google, what have you. And that actually accounts for 60% of AMD's volume and Intel's volume. And I asked the appropriate people, is this trend going to continue? And they said, yes, we expect more of our CPUs to be custom variants. And with Trento, it's actually a hardware customization. And I think we're going to see more of that, which just complicates things. But it's, yeah, probably only a very slight modification. The whole point about chiplets is that you can be a lot more custom. And that's the hope. Well, that's like from their side, from outside, it just complicates things too much. Yep. You showed that there's a lot of money there. So if you would have to invest your money, which chip would it be on? So there is that all the companies that I showed are VC funded, they're not publicly traded. So you have to be part of a VC fund. It's tough. The conversation I always have with Sally is that we're training versus inference. Training is where the money is usually. And it's very hard to compete with NVIDIA. NVIDIA has 90% of market, so you're fighting over scraps. And it's the companies that have the value add over the others. What are they doing differently to NVIDIA to help them carve out their niche? Cerebrus is doing the whole way for scale thing, which is doing really well. TenseTorrent has that scaling solution, which sounds very promising. And I highlighted the ones that I think wouldn't have this unique. So Sam Hanover as well with their CDRA architecture. I think the architecture to the people in the community, it gets a bad rap because the utilization looks low because you're cycling around data and computing in the chip rather than taking it off and doing like a systolic array. So yeah, in terms of startups, those are the ones that I gravitate towards, just wanting information. There's also companies like GROC who have this very specialized chip for batch one inference. They've got a deterministic chip basically, and that's really impressive. But the question is always customers. And the key thing to look out for with customers for these companies, are they giving their chips away to academics? If they are, then it's because they don't have corporate customers, they don't have B2B customers. And that's I think kind of the rabbit hole that graph core is going down. So I speak to Cerebrus and TenseTorrent and they say, yeah, no, we've got installations at universities, no problem because that helps us get the chip out and people actually using the chip and the software stack. And that's fine because they have a bevy of B2B customers to talk about. Whereas there are a lot of companies that don't and they'll say, our customers want to keep quiet. Really, did not one? Anyway, yeah, so. All right, thanks a couple. I think I'll be behind you. That's a big question about. Did I, was it? I'm sorry. And I, I really understand about the Sapphire rapid B4AI. So the point is, one, it's a host, so you don't need anything else. Two, so new instruction sets in Sapphire Rapids. You've got the integer eight operations, DP4A, BNNI, the whole point is, you know, you go down the reduced precision route, which, yes, other chips do, but again, it's it's the single chip solution. And I think with the Falcon shores, the MI 300 with a CPU plus GPU plus memory on top, the whole thing is, can you do everything in one chip? Yeah. Because you need CPUs to be the back end. It's, so, so one thing to, I didn't say in the talk about AI is a lot of AI inference is still done with CPUs, even though there are lots of GPUs out there used for training. If you look at the scale of all the AI workloads from inference to training, just in terms of complexity, GPUs, yeah, you can use for all. CPUs you can use for all. They actually look at what's used today in the data center, in the cloud. CPUs are actually used for most of the inference. And it's because you can do things like integer eight operations and they're accelerated. To a certain extent, they're also to stomach in like TensorFlow to star fashion, especially with a deep for AI and VN and I. So if you look through the Intel marketing materials, and again, I apologize in advance, they, if you look at most of the metrics, they advertise. It's okay, there's some ML perth, there's ResNet 50, but it's mostly inference. So think about it in terms of the inference, think about it in terms of the end customer wanting to essentially have a platform that does it and doesn't want the hassle of additional overhead of sporting GPUs or Intel does a lot of work in optimizing code as well with its key partners. And if your code is self-optimized for one player, it doesn't give you an opportunity to move off with somebody else either. That's like a 10-year commitment deal sort of thing. Intel is in a rough position with Sapphire Rappers just in general, it's pure cores and AMD. It's very focused in some of its performance peaks, whereas AMD is a bit more general, but inference I think is something that they've got a hold on because they also have a very extensive software stack and optimization in there as well. And they will shout at you for 10 hours about it if you want. If you move from question, you can read in line. You gave a great overview of the hardware, what does the software look like for all of these new chips and what is it gonna take to unseat CUDA? Yeah, this is the eternal question. It will anything ever unseat CUDA and the answer is probably no. We have companies on that list like Graphcore, first chip in 2016. If you leave them at 2016, it's like their software was ready. Apparently now they're only just getting to version 1.0. So where exactly is it on the scale? A lot of problem with these companies is that especially the venture capital ones is that they'll have an idea and they'll start working on software hardware. Maybe they'll have two thirds or three quarters software engineers versus hardware engineers like Ken Starrin does. But then they'll pick up a customer, a very important customer that's giving them their first revenue or actually it's a revenue and then they'll allocate time to helping their specific customer with their specific issue rather than creating a more general platform for everybody to use. I think the advantage that a lot of these companies have is that PyTorch, TensorFlow, Onyx, standard AI frameworks exist. So if you go towards those targets then all you're dealing with is the underlying compiler and if that works. And obviously additional tweaking and beyond that. And we've got companies now talking about is there a PTX variant in some of these AI chips, you know the sort of lower level language rather than just dealing with a high level CUDA. So that is still going on. There are a few of the companies that aren't doing so well are the ones that are compiler driven. You know they're trying to get maximize efficiency through their compiler rather than just simply the architecture. And historically companies that go compiler first don't tend to do well. So yeah, NVIDIA because I started when, so my route into CUDA was that I took the first ever CUDA course in the UK. I was run by a finance professor who had some money from NVIDIA. And he did a free course and I went on the same course the next year just to show up my knowledge. And that was what, 2009? So we've got 14, 15 years of dedicated university driven training in CUDA. The only other company that can even come close to that probably Intel. Everybody's still focused on x86. So if you can get an x86 to onto Vecchio or GPU compiler you know it's gonna be very difficult. And first one you confused me a little bit with the integer I thought 255 is the next one. But I was just going off the diagram. But then my colleague Alan Oak up here he explained a bit of the three folks and so you have, you're as close to 10, pretty much all the numbers up to 10 and then of course, yeah, so do you remember where I said the Microsoft FP12 variant where it's just a bunch of FP4s and then a fixed event mantissa. So what some companies are doing with those formats is you apply an offset a bias to the range. So okay, you can only do say 0x256 if you apply a bias of 4,000. It's actually 4,000 to 4,266. Again, Tesla's doing that a lot in this. I should have shown a slide because they're doing extensive work on getting the right floating point format. So even if you're, when you do your typecast to your reduced precision, if all the data was outside your reduced precision format you're just gonna either clamp at zero, clamp at one or you know, min max. But if you apply the bias then maybe you can capture all the data and then apply that. And then you have to manage dealing with packets of data in either in the same format or different formats and how they're computed. And that gets difficult very quickly. Yes. Yes, because they also have high efficiency but that wasn't exactly a success and because it's more constant. Yes, it's different because the cores don't suck. Simply put. So with the Xeon Phi stuff that you're talking about initially they use the P54C core, you know, Pentium variant and then they use the Silvermont Atom cores which are terrible. The Atom cores, you know, have a reputation of being terrible. With the E-Cores this generation it's, I think there's six generations on from those Silvermont cores and that has given Intel an opportunity to do a blank page design. If you go into the architecture into the micro architecture of the core it actually looks like one of the modern P-Cores but just reduced in certain aspects. So for example, there may not be a microcache now but now we've got dual three wide decoders which can act as a six wide decoder versus a six wide decoder that already exists in the P-Cores. So the different optimization points for efficiency but the cores look relatively similar now and AMD's difference is just, we believe just cache levels for their E-Core variants. But yeah, I've got chips, consumer chips now that are just E-Cores and a modern E-Core performs the same as a Skylake core in 2016. So it is performance. It's not today performance. It's six years goes performance but it's better than being 20 years go performance. Sorry, yes. Yep, you've been reading my Twitter feed. So the question was about posits, the number formats where you just have more than sign experiment Tissa and lots of other stuff as well. A company did reach out to me called Vivid Sparks who apparently John Gustafson's on the board the inventor of posits. So stuff exists is happening. I haven't looked into it much. I think Vivid Sparks for example is fully self-funded at this point or very early seed backer funded not a proper series A funding round. And part of that is also the utility, understanding where they can be performance and what sort of workloads they can be useful. Obviously, posits because of their additional bits to talk about different things can make them more versatile but then you're just dealing with more bits anyway. So how do you get around that as well? We've had 50 years of dealing with floating point and integer numbers in the formats we have. We have zero years of posits. So it's very difficult for them to come in. It's the same with different lithography techniques and stuff. But yes, did you say you have a second? Open CL exists. So I haven't kept up today with it as much. Yeah, so the question is where has open CL gone? Yeah, it exists. It hasn't disappeared like, let's say open AMP did. Whenever there is a question about data parallel to C++ or CICL or open CL is always mentioned in breadth or Vulkan open CL or so. So it is still mentioned. And if you, all the diagrams that I showed, the NVIDIA, the AMD and even the Intel one, they all match well. So they all still have things that support it. And obviously the libraries try to be agnostic and the compiler's just tried to deal with it.