 Okay, so we have another lightning talk now on getting started with FPGAs for packet processing. Okay, thank you for joining this lightning talk. I want to talk with you about some mystery of the hardware acceleration for the data plane. And let me talk, introduce myself, who am I? My name is Mirak Balukiewicz. I work as the solution architect in the program of solution group in Intel. And my daily job is to work with the customers and optimizing by workloads, especially in the networking space. It's a short introduction about me. What is a data plane? Yes, sometime ago I visited the computer history museum in the mountain view in California. And I find very interesting things, which is the first network card which was designed over the world. And it looks like that you have the cables which connected. And currently we as Intel are making the very similar thing, maybe a bit more complex, which is called the FPGA card with what we call the pack program of acceleration card. And 3000, which is designed especially for the data plane acceleration. Yes, I want to introduce you and making the very, very short introduction. I want to, one on one session that how to use this card and how to make a good solution using that kind of the hardware acceleration. Let's start from your problem that I work many, many years as the software designer years. My work was to finding the optimizing the software and finding the good, finding the good solution years. It is the problems in the software could happens, could have many, many sources. Yes, we could think about the hardware acceleration, you could think, for example, okay, encryption is the first one years, but it's not only the encryption years. You could have different problems. Yes, you can have a problems with the jumps, you can have a problem with the loops, you can have problems with the PCI access. And first of all, you can have the, for example, the problems with the cash using on the CPU and how to do how to make first step to optimize the software years or to use the hardware years. The first step is to find the hotspot. It is and understand how your system works. It is finding the hotspot is very important thing to find a problem what you want to solve using the FPGA. The FPGA has the limited size and it can solve not every problem could be solved. That this is example of your problem. Yes, so you are the engineer or you are the designer and you want to find some issues or to make sure that your system which is the very, very complex is working correctly. Yes, don't try to put everything inside of the hardware years. Don't try to put all the problem or the complexity in the hardware years. Today it is impossible and making, for example, the Linux kernel, maybe some guys can explain you that it will be in some future it will be possible because we have new languages for the programming of the FPGAs. Yes, don't believe them. It is impossible to put such complex. What is possible you can get some part of the hardware and some part of the problem and put this into the hardware years and keep everything around of this problem in the software. What are the tricks here to optimize solution? Yes. The first of all, be careful how the data processing between the hardware and your software application looks like this. In the having the hardware, working with the hardware, you have two, really you have the two hardware components. One is your CPU when your program works, when this mind problem is execute. And another hardware component is the FPGA, which you try to find or to accelerate some part of the problem. Yes. Let's think about what could be the bottleneck. Let me provide you some example. Some time ago we tried to accelerate some machine learning function. You know that the machine learning of the deep learning is just that you have the small number of the data and this data passes through a number set of the tables. Yes. Which contain which uses the very, very big memory size and you make the big number of the lookups. You have the big numbers of the change going through the tables. Yes. And every step in the machine learning, the number of the tables explodes. Yes. So you can have even in some application, you can have even the hundred tables, your data, your data going through the hundreds of the tables. Yes. And designing the FPGA for this model, we provided some simple approaches. Okay. We know the most of the operation on the tables is the multiplication. Yes. You know that. Okay. Multiplication. The hardware must be faster than the multiplication multiplication in the software. Yes. Because the machine learning is the multiplication. Yes. We implemented the very, very good multipliers. Yes. In the hardware for the data and what we saw software still works faster than the software plus hardware. Yes. Why this happens? Yes. It happens because the killing factor was the access of the moving the data between the software and the hardware is over the PCI. Our hardware is connecting to the car. Hardware is connecting to the car through the PCI. Yes. And imagine that, for example, you have a tensor flow and you imagine that you have a packet and first of all, the software has the packet. Software has said data to be passed and it must be moved to the hardware accelerator. It changes the location. The data changes the location. Yes. And look very carefully what how the data is going. Yes. For example, the transport of the data between the host and accelerator could be slower than the moving of this data inside of the CPU using the cash. Yes. Additional kind of the problems. Yes. For example, in the data plane processing, I am the guy who works mostly with the telco guys. Yes. The telco guys as the crazy for the counters. Yes. You know, the everyone knows the mobile phone and the most important part of the telco job is to charging you. Yes. Counting you, counting you, counting you. Yes. And counting your data. Yes. And it is the very, very big problem. What they have. Yes. And making, for example, the counting problem of the statistics problem in the software is the terrible job because of the software architecture. Yes. To make sure that, for example, you want to make a counters, you must create the counters. For example, your mobile phone, which could be accessed from the many cores from the CPU point of view. Yes. It creates many critical sections. Yes. So I hope that everyone is the software guy. So and maybe I'm not the one who is making the hardware. Yes. But and this critical sections, which means the implementation of the critical sections means that you should implement tons of the atomic operation, which are the terrible for the cash access. And you can loss all your parallelism in the CPU using that kind of massive of the critical section operation. Yes. Solving this problem is to, for example, moving some data, some counters, which are the most painful, not all counters, yes, but counters, which are more painful. We call them sometimes the aggregate counters, which can must be handled by the many cores to the hardware. Yes. Which can solve the problem that the hardware is processing the data parallel, serially, but the software do not have this problem again. And the software can make most of the processing in the parallel way. Yes, using the cash very, very efficiently. Yes. This is the kind, some kind of the problems. Yes. Additional problem, what is to try to put it is additional tick trick, what you can do, try to do not resort, do not try, do not put the part of the problem into the hardware. Yes, to the FPGA. Try to solve as much as possible inside of the hardware. Yes. For example, going back to our example with the machine learning, we found a very, very good result when we just put some operation of the multiple tables and we were able to put them into the FPGA and making, for example, the most of the processing inside of the FPGA. Yes. But of course, it requires that not every machine learning algorithm is possible to put, to be put inside of the FPGA. Yes. You must look to the special algorithm, which are, let me say, matching the hardware capacity. Yes. The normally, in the university, when you are learning the programming, that the people thinks that, for example, the resources, the memory of the CPU cycles are infinite. Yes. It is just, whatever you make, that you make a table, which is the gigabyte, you make a table, which is 10 gigabyte, no problem. Yes. In the hardware, it's not the case. Yes. In the hardware, you can have maybe six mega memory of the very fast memory. You have maybe a few gigabytes of the, some smaller memory like DDR. Yes. And it is that something what is possible to you. Yes. It changes your way of the thinking. What else you can do. Yes. Additional tick trick is to make sure that your FPGA or your hardware is invisible to your application. Yes. Here this is example of the segment routing V6 acceleration. I was talking yesterday about more detail way about what is the problem. I don't want to go into details how it works, but imagine that, for example, the situation. Yes. And this left side, you have the not accelerated part. Yes. Everything happens inside of the software. Yes. And really, this blue part is the part which is, is the part which creates some router or the switch inside a software switch. And this blue part is visible to the user. Yes. In our acceleration solution, we put the hardware acceleration inside of the hardware, but we keep this invisible to the user. Yes. Or the to the application. Yes. You see, for example, in the model, the most important part, which is the application that VM on the top of this on the top. It do not see the FPGA and it's not aware that the FPGA is used at all. Yes. It is the most important part because you can see, for example, I can make the perfect solution with the FPGA, but big problem what we have is the how to deploy this one. Yes. How to make the user using this one. Yes. And imagine, for example, that you want to go to the data center. You have the millions servers and you will say, okay, I want to change the millions of the applications. Yes. Running because I have created my one smart FPGA application and my application can use the FPGA. It won't work. Yes. You must hide your FPGA inside somewhere in the system to make sure that it is invisible and everything a part of the FPGA is not changed. Yes. Okay. Let me some short and what we can achieve using the FPGA. Yes. That we are using FPGA for the data plan to get the better performance. What means the better performances in the software solution for the data plan using the course for the packet processing. Yes. And we have the course and for example, the software solution in some cases is can use the 14 courses. It is for example, the half of the course using by the server. Yes. With our accelerator, we can do this using the four courses. Yes. It is the three times or the four times less course and having the more course for you for the processing. Yes. It is just the example what you can achieve. And without the any change of the application, it is just the solution made by our partner, the HCL with the summer customer who was making the real deployments. Yes. And they achieve that results with the customers. Yes. Everyone is happy. And so we have the better performance and the customer can run the application without any changes. Thank you very much. We have the few seconds to the one question. Any questions? Okay. Thank you very much.