 It's more like a presentation, but the TPU, the recent TensorFlow, so processing unit from Google. So if you don't know me, my name is Brahim Ahmadishaf. You can Google for me. I'm working for ASTAR, and I'm mainly doing FPGA. So really, the presentation of this paper is basically a paper that is going to be presented at the 44th International Symposium on Computer Architecture in Canada from Google researchers. And I will go through motivation, pictures, photo, internal performance, and then give you a summary, and give you also some reference to the patterns from the paper, because that's for technical people who want to know how Google is basically publishing and knowing more details about the technology behind. So this paper really has a lot of authors, as you can see from the list of authors in there. Sorry. So it's online. The link is here. So later on, I will provide this PDF of the presentation and put it online. You can all of you download and try to read the paper. So the main motivation really behind developing this TPU is that the Google data center have a lot of demand and workload. And basically, a lot of the neural networks' inference, 95% of them include multi-layer perceptrum, MLPs, convolutional neural networks, CNN, and long short-term memories unit. And really, Google wanted to develop some ASIC, some really alternative to CPU-GPU hardware to provide some peak performance. And this paper explained the details of this TPU and compared the performance in terms of peak performance, what, and cost also. So really, the whole story from a research point of view, it's really a comparison between CPU-GPU versus ASIC. And really, they compare server-class Intel as well processor and NVDA K80. And they are the latest development, which is this tensor processing unit. So overall, the paper presents the TPU, but mainly focused on the performance comparison. So we've probably seen that in the news. So the TPU is a very small unit, has really a processor, basically, if I can get the TPU here, is behind this heat sink here. And then you have, basically, more and more shots. So the design, really, from an electronic or a design point of view, it's really a big processor in the middle, some RAM, a lot of aspect here dealing with the power. And then on the right here, this is a PCI interface. So from the paper, they say that it basically slots like a SATA disk. So you basically transfer data and you get results. And then you basically accelerate your neural network processing. So the internal, on the left, you have the PCI Gen3 interface. This interface to a host here. And afterwards, you have an unified buffer. This is more or less the memory. A systolic data setup, which basically drives this. The heart, really, of this TPU is a matrix multiply unit, which is doing 64,000, 64K multiplication. One of the main issue of this TPU is that the multiplication is actually 8-bit by 8-bit. So this sort of hardware cannot be really used for training. It's more for production settings. And recent papers have shown that even by reducing or quantizing the coefficients of those neural nets, you can still achieve some good performance. So ultimately, this is really a very custom designed chip to achieve those performance. And then you have accumulators, 32-bit accumulators, activators, and then a whole lot of control around. And you can access also some DDR3 memory. So in terms of ARIA, they basically mentioned that the memory here is 29%, 24% for the matrix. And then the rest is small, 6%, et cetera. So this is usually sort of a design aspect when you want to see the area of your chip when you design those babies. And ultimately, the control, basically, feed data, have some accumulators here. And you have 256 accumulations every cycle. So it's really a chip that's been designed to really do this transfer operation. So one letter, you compare that to a CPU to GPU. It's really a customized, a very hyper-customized chip. So they do a lot of one-to-one comparison. For example, as you can see that it's running at 700 megahertz. Usually, the CPU runs at a much higher frequency. NVIDIA cards, it's a lower frequency. But in terms of power, of course, it consumes less. And in terms of memory, it has quite some large memory to get some of the data. And it's really a proper customized chip to achieve what they want. They look at performance, and mainly using at watt per die versus the target of the workload. And the comparison, really, for the GPU is that this performance, basically, or this sort of aspect stay quite constant was the watt per die for the CPU really go up and up and up and up as more load, basically, is put on the CPU or the GPU. They also look at the performance of the GPU versus CPU, TPU versus CPU, et cetera. And then here, we can see that the performance ratio of performance relative to watt here becomes 196. So it's really shown that those chips are really consuming less power per performance compared to a normal processor. This, from a really research point of view, or sort of a performance is really an interesting graph. It's called a roof line. So we have those sort of performance that have some sort of roof here. And then it basically max at this, I think it's called a Tera operation per second. And it's a log scale. So this is the limit of the TPU. And it's much, much higher than the limit of the CPU or the GPU. And those are actually log scales. So actually, the performance is really amplified. For people who want to know a bit more about the roof line, there's this paper from Samuel Williams. I can provide some links or some PDF if people are interested to read more detail about this. And they look at the specific neural network that they use in production. And then they basically plot different performance for different cases. And then we see that really the TPU is achieving the maximum. So summary of this, the TPU is really a NASIC. So it's really a very specifically designed chip for tensor operations. It's really a big matrix unit, 256 by 256. What we have to remember that this is an 8-bit multiplication. So this is not like a GPU or CPU have floating point multiplication. Overall, on average, it's 15 to 30 times faster than the GPU or CPU. And it has a teraflop per watt, about 30 or 80 times higher. So it's really a much more performance, basically card to have in your production thing. And from a future point of view, they also mentioned that they are using only DDR3. And the GPU these days, they're only using DDR5. And this mentioned that they could actually triple the teraflop per second performance. So ultimately, they'll get even more and more performance out of this chip. These are the list of the patterns that they refer in the paper. I think for people who are interested in the details. So a lot of it is really like methods relating to TensorFlow, like the batch processing, the vector computation, and really a neural net processor. Paralyzation of the convolution aspect here. And then the last one is computing of a convolution. Prefetching weights, this is more like an architecture in terms of hardware. How do you actually quickly. And this one is rotating data for neural network computation. So I guess this is more on the training side. Maybe this is related to the batching. I haven't read the PDF, it's not available, but the text is actually only available from Google patent. So this is just, as a summary, ultimately, we may even have not access, but ultimately, they will develop maybe some TPU that they will commercialize. And ultimately, the GPU may become a bit more obsolete in the future for these specific applications. Because in terms of power, in terms of compilation of that model onto a GPU or onto those, those are really customized for the tensor. So that's my short lightning talk. Thank you very much. So in the future, there might be some maybe other talks when I try to provide some sort of FPGA small equivalent of those and try to see whether, ultimately, I want to run the TensorFlow network and later on dump that into an FPGA and actually do it in a real-time version, really, on the hardware rather than having a TPU. That's it.