 Okay. Our next talk is going to be about cooperative perception of future cars with Ganurego and Augustus Vega is going to tell us about this. Okay. Thank you. So good afternoon, everyone. My name is Augusto Vega. I am a researcher at IBM, DJ Watson in New York. It's my first time at FOSM. I am finding this really exciting as far as I have seen. The talks are quite technical. So I don't know if I will be that technical. So hopefully, you will still find something of interest today. This is about a cool project that we are doing in the context of, in general, connected to what Dr. Beter was referring to before heterogeneous chips, heterogeneous architectures. But I will talk about the application that in our case is driving that which is cooperative perception in connected autonomous cars using radio. So very quickly, thank you very much to all the people involved in this project, my IBM colleagues as well as our university partners and in particular students, Paul Zock, who are doing amazing work. Also, thank you to Dr. Tomron Zock, who is the program manager of the DARPA, this program under which we are doing this work. So the talk will be organizing in this way. I will start very briefly describing this epochs project that we are at IBM leading with the university partners. A little bit about this new era of heterogeneous chips. Then I will go from that broad, let's say, concept to more specific into one piece of that project, which is this epochs reference application, or ERA, which is about cooperative perception in vehicles. Then I will focus on something even more specific, which is one element in that application, which is the AO2.11p DSRC transceiver within ERA, and some optimization and acceleration opportunities that we have either identify so far. So let's start saying that in 2018 DARPA started this program called Domain Specific System on a Chip, DSRC, which ultimately is to develop a methodology that allows us to build heterogeneous chips very fast for an application domain of interest. In this case, when we say a domain is not just a single application, it's more than that. Actually, we talk about the super domain of embedded processor for autonomous connected cars in our specific epochs project. More specifically, we say that that application is for cooperative perception, including elements that come from the domain of computer vision and as well as software defined graded. This is why we talk about a domain and not one single application. So what is cooperative perception? First of all, this is not the term that we have invented. Many car makers are talking about this today, but the idea is simply, I mean the way automakers have been evolving, advancing cars, connected autonomous cars so far, is by making a car as intelligent and powerful as possible, by putting more sophisticated sensors and more capable computation engines on board. We try to make the car, let's say, self-sufficient. But still, there are some limitations to that approach. For example, I like this simple example. There is a car trying to do object recognition, and the car correctly detects two other cars in front of it, but it also detecting some bicycles and also a person. But if you look carefully, these are just the car which is a stick to the back of the car in front of me. This is a misclassification problem, pretty common in vehicles today. Clearly, if we keep doing a car, very powerful, there may be still some of these problems that, I mean, we need to think a little bit harder how to fix. So what we like to study or let's say to propose is this complementary approach of a multi-vehicle cooperative perception where cars can very closely interact with each other, right? In addition to doing its own stuff, cars can interact with each other to resolve, for example, this kind of ambitivities in real time while they are driving an environment. Specifically, what we have in mind as part of this project initially is relatively simple. We want a car to create a representation of the war, in this context, in the form of two-dimensional occupancy maps. A car will create this 2D occupancy maps with information about the presence or absence of obstacles in the surroundings, and these cars will exchange this map in real time, right? More interestingly, we'll fuse all these maps together. I mean, a car will have its own representation of the surrounding and will receive other maps from other cars, and we'll try to fuse all these maps together to have a more precise, more accurate, let's say, vision of the war, right? So this is what we are doing right now in the current version of the open-source application that I will introduce, but in the longer term, what we want to explore is something that we call adaptive storming intelligence. It's more like a more swarming, more complex than just exchanging plain raw data. We want cars to eventually learn from each other, right? Exchange knowledge, right? Things of that sort. But in general, what we believe is that the number of false predictions while a car is driving can be significantly lower by benefitting from this swarm-based cooperative approach is compared to the car-centric only approach that, let's say, car manufacturers are following today, basically. So this is the motivation of why we started building this application. So let me very quickly say two words about the Epox project, which is our IBM, let's say, solution for the design challenge presented by DARPA's VSOC program. So as I said, it's not just about the application, it's about the methodology that will take an application domain of interest in our case from the domain of connected autonomous vehicles. And it involves, let's say, a series of steps to generate the underlying SOC, heterogeneous SOC that we need to execute that application while meeting some metrics, performance, throughput, power efficiency of interest, right? So it's about the methodology and there are many steps involved. We are developing an advanced compiler as well as a scheduler. We are also studying some mathematically grounded ontology generation mechanisms to more or less automatically identify what are the pieces of your application that are warping accelerated in hardware, right? Instead of doing that like brute force, as we usually do, we need something that is more mathematically grounded, right? This is the ontology generation part. Then we have more hardware related steps, for example, the design of accelerators, NOC, memory architecture for that specific, for the hypothesis generated by ontology. So the ontology tells, well, these are the kind of part of the software that you should accelerate. So then we need to determine what accelerators we need, how do we connect them, and what kind of memory system we have to put. Then we implement, initially, implementation in our case is just FPGA prototyping, but we need to also take out this chip at least a couple of times to generate the final SOC. So Epoxy is about the methodology, but today I will just focus on the application, right? Also to give you a little bit of context, this is something actually that Dr. Betel presented in his presentation before. This is the fullest stack that is being addressed as part of the DSOC program. This presentation is just about the application, right? Later on today, I will be presenting something about the scheduler also, and I will be touching this other layer here, the operating system. So let's go into the interesting part, the Epox reference application or ERA. So ERA is an open source application available in GitHub for, that basically implements this idea of cooperative perception in connected autonomous cars. And there are two important parts in ERA, one is the communication fabric and the other is the sensing fabric. So communication fabric, as you can imagine, is all about communication in this case between vehicles, vehicle to vehicle communication, V2B, using DSRC, new ray implementation of the A02.11P protocol. So that is the top box in this diagram. And then we have the sensing fabric, which is about collecting sensor data from the sensors in the car, right? And generating this representation of the surrounding, which in the current version of ERA is a two-dimensional occupancy grid map, right? So then these two things work very closely because in this way we allow a car to generate its own map and receive other maps from other nearby cars, and fuse all this map together in real time to have a better vision of the surrounding, right? So this idea in very general terms. So ERA involves multi-modal sensing, although today mostly is a camera sensors. It involves a generation of local occupancy grid maps. It involves DSRC-based vehicle-to-vehicle communication and finally real-time fusion of those maps. So in the current version of ERA, version two available in GitHub, we are using a simulator called Gazebo. I don't know how many of you are familiar or know Gazebo, raise your hands. So Gazebo is a pretty detailed physical simulator for robotics environments. So this is what we are using in the current version to simulate an automotive scenario, although our cars in this case are robots, and our streets in this current version is just a simple 3D war where this robot move around basically. We will change this in a future version of ERA. So we have Gazebo where we have this robot with these depth cameras attached, collecting information from the war, right? So this is our war simulator. Then we have a block called Cosmap2D, which is part of the robot operating system, ROS. How many of you are aware of ROS? A little bit more, okay? We are getting close. So ROS is not an operating system by itself, actually, it's just a robotics software infrastructure that provides many libraries for to build very quickly robotics applications. So we are using one of its available models, which is a Cosmap2D block, which allows us to generate these two-dimensional occupancy maps from data collected from the depth camera in the robot. So we generate in real time, many times per second, these 2D maps. And then, by the way, I mean, it's not just about the 2D map, it's not just about the presence or absence of blocks, it's also about what those obstacles are. So we also do object recognition, right? We want to label also those blocks. It's not just, okay, there is something there, but we want to know that if there is another car or pedestrian or tree or whatever, right? Then we take that and we pack it, we serialize it, and we inject it in our new radio transceiver, which by the way, this is an open source implementation by Bastian Roesel. I think 99% of you know Bastian very well. So we took Bastian's implementation of the AO2.11P transceiver and we integrated that into an ERA. So this is where we are using a new radio. And this is interesting to mention that we have like two disparate wars coexisting here, which is new radio and more Roes-Galzivo, right? So we had to build like a Roes-New Radio interface to allow these two wars to coexist, right? So I think that that was something pretty interesting. And finally, we will also receive other maps from other cars through the new radio receiver. So what we want to do is to unpack that map and to merge together, to fuse together the locally generated one with the remotely received ones and provide the final version. This happens several times per second, by the way, it's very CPU intensive, right? There are two ways to execute ERA today. One is a two computer setup, where you take ERA and you deploy it in two different computers and you have over the air communication between both of them. For that you need, of course, USRP devices. So the other easier way to execute ERA is in a standalone mode, where the two, let's say, instances of ERA, of ERA running the same physical computer. So the communication between both of them will be using a regular network socket, right? But functionality wise, this is the same as having over the air communication, right? So probably this is the easiest setup to start with. So now let me go into a little bit, some characterization that we conducted on the transceiver of ERA and some optimization opportunities that we have identified that may be of interest. So as I said before, ERA has several components. I will be focusing now on that box there, the vehicle-to-vehicle communication part, which is actually the AO2.11B new radio transceiver that we took from Bastion, right? So when you take a software that you don't know and you want to identify acceleration opportunities, what is the first thing you usually do? Well, you characterize the application on a well-known system. So we took this transceiver, we executed it on some machine, right? And we start measuring the amount of computation cycles taken by different parts of the, in this case of this flow graph, right? So what we observe, the very first hat performance analysis is that there are actually like two functions, new radio functions that are consuming most of the CPU cycles. These are the complex exponent floating point function, which takes more than 30% of the execution time in this current implementation. And the other one is the computation of the Viterbi butterfly, which is a little bit more than 10% of the overall execution time. So we say, okay, we have two candidates to optimize. Let's start with these two. And by the way, what are the parts in this flow graph that are making use of those two functions? And we identified these four blocks, which is packet decoding OFDM equalizer and synchronization, long and short. So these are the ones that are highlighted in red. These are the blocks that are making use of these two functions most of the time. And actually, the four of them belong to the receiver, which means that, well, the receiver is more critical, clearly it's more critical than the transmitter, right? So, okay, we started with these two functions. Let's see how we can accelerate them and how much benefit we can get in return. So the first thing we did is, okay, we define a CPU, a baseline, which in this case is a general purpose CPU, an ARM core A53 processor, very, very basic stuff. And we measure, this is just for one of these two functions, for the complex exponent function. We measure on average how many cycles it takes to call that function. And in this case, it was around 37 CPU cycles on average for the execution of this complex exponent function, right? Okay, so we move forward. We say, okay, let's design our own preliminary acceleration engine for this function. The way this function works is very simple, actually. I mean, the computation of the complex exponent is a computation of an exponent multiplied by this cosine and sine elements there. So we say, okay, let's design a dual-path pipeline, where one part of the pipeline, the bottom one will compute this piece, the top one will compute the cosine and sine and add them together. And at the end, we will multiply to generate the final result. So this was our very first version of our accelerator, right? Very simple one at the beginning. We had some issues that we had to deal with, like for example, and balancing between these two paths that we managed to fix. So we start measuring the execution time of this function using that accelerator, as expected, we've managed to go down from 37 to around seven CPU cycles per call, right? Which was a huge improvement. Of course, since this accelerator is, let's say, outside of the CPU, then we start having some memory copy overhead, right? We have to move the input data from the CPU into the accelerator and then move back the result. Again, into the CPU, right? So this is our, let's say, first version, not very much optimized of the accelerator. Then we decided to try something also relatively straightforward, say, okay, let's try to use a vectorized version of this function, right? And voila, the vectorized version took even less cycles than the accelerator, right? So around five or six cycles. So that was actually very interesting. We said, okay, we don't have to move data because this happened on the same CPU and the same core and it still performs better than our accelerator, right? But our accelerator was not very well optimized at the very beginning, so we start looking into some more aggressive optimizations. And what we realized is that, okay, I mean, in this case, the accelerator was running at 100 megahertz, right? So to be conservative, say, how much we can increase that frequency? Well, we can go up to 300 megahertz without violating timing and things of that sort, which will provide us like three times speed up, right? Plus, we can have for the silent board that we are using, we can have up to four copies of this accelerator running in parallel. So we have another four times speed up. And the most important thing is that we believe we can eliminate the memory copy overhead because the memory copy overhead here is mostly due to the way virtual memory in new radio buffers is mapped to physical memory, right? So that mapping is changing all the time in radio and therefore, it forces us to keep copying data from one physical location to the new physical location where the buffer is mapped to now because the buffer is actually rotating, right? If we could get into a new radio and fix that, then we could eventually eliminate, ideally eliminate that extra memory copy overhead and the result would be like each call of this function using this fully optimized version of accelerator is around one cycle per call, right? So significant performance improvement, right? Let me very quickly, because I am also arriving to the end of the time, but let me also quickly tell you what are our plans for ERA because we need help, right? And we think this is very exciting, right? Connected autonomous cars. Who doesn't want to work on connected autonomous cars, right? Don't answer. So the current version of ERA is version two I've available in GitHub, as I said, the one that supports this two computer setup. We are going into version three of ERA where we will replace that with the more realistic automotive simulator, emulator, like for example, Karzym or Karla. This is something we are trying to decide, right? But more importantly, this part of this diagram here, layer one and layer two is what we have today in terms of software platform for automotive. We have very good war simulators. Karzym, well, Gatsibo can be used for automotive too. Karla and LG SBL are some examples. We have very good automotive platform to implement perception, plan and control in simulated environments or in real cars, right? But we don't have that piece there that enables vehicle-to-vehicle communication in this existing software ecosystem. So this is what we want to provide with ERA version three. We want to create the missing part here which can interact with layer two and layer one. And if a user wants to have in a very similar manner support for vehicle-to-vehicle communication, regardless if she or he is using, let's say, Karzym plus Apollo or Karla plus AutoWare, right? So this is what we have in mind for ERA version T. ERA will be only intended to enable cooperative automotive. We support for DSRC, currently supported, and eventually in the future, 5G, right? So this is what will make ERA unique, right? And we need help for that. So to wrap up, I think we all agree that the domain-specific heterogeneous SOC ERA is here, based on some talk that we have already seen. This is because we need to significantly improve performance throughput as well as power efficiency. DARPA understands that very well, and this is why one of its programs, DSRC, is about heterogeneous SOC. So under DSRC, we are developing this open source application called ERA for multi-vehicle cooperative perception. That includes, as I said, local sensing plus vehicle-to-vehicle communications in the same application. In other words, ROS, the robot operating system, and new radio can coexist together if at least based on our experience, right? DSRC plays a critical role for vehicle-to-vehicle communication, right? If we want to have a war where cars can interact with each other and swarm between them to exchange not only raw data, but also, let's say, knowledge experience, then we have to focus on how those cars will communicate in real time, right? High throughput, low latency, right? So we have to put the focus on how do we accelerate that? And finally, as I said, we want to turn ERA into a benchmark for cooperative mobility that can be easily plugged into existing automotive platform. So if you want to collaborate, please reach out to us, to me, check out ERA in GitHub too. So that's it. I don't know if there are questions. So the question was, the first part was if we are considering autopilot when we talk about ERA version three, I think the question is in this figure, do you have a box that is autopilot too, right? In addition to Apollo or AutoWare, right? So I mean, the way we want to build ERA version three is independent of anything else. We want to actually define some APIs that can work with eventually any other infrastructure, let's say Apollo, AutoWare, autopilot or whatever, right? So it should be extensible in that regard. So yeah, I mean, the answer is yes, in the future we can also include autopilot in this diagram. And the other part was about the communication, the protocol between cars. That is actually a very key question, right? And something that we are investigating as part of this project, right? How do cars communicate to each other? For now it's just we take these maps, we spot them, we compress them, we serialize them, and we exchange them in a very simple manner. But we need to define a protocol and maybe one option, yeah. I've got one really quick question. Yeah. What are the buffers you're copying in and out of your accelerator? You mean the buffers in between new radio blocks? How big is the Ginoradio buffer feeding and taking data from your accelerator? Don't remember the number, let me check it out. Yeah, I'm curious. Yeah, yeah, yeah, because that changed many, many times already, but. It goes depend on how. Yes, yes, yes, yes, let me check that number. I think there is one more. Kilowytes, or kilowytes? It is in the order of the kilovytes. Okay, kilovytes. Yes, yes, yes, yes, yes, yes, yes. Yes, security is key here. Actually, something I didn't mention, we are also moving from ROSE 1 to ROSE 2 because ROSE 2 provides some security features that we will leverage in era version 3. But we believe that in general, swarming can help us to identify like some, let's say, adversarial cars in your swarm, right? For example, running some real-time consensus mechanism in real-time, the cars can say, well, that guy here is being robbed, right? So we should probably take him out from our swarm right now. So, but it's a very big part. DARPA is very interested in that IBM also, so we are looking into that. And I don't have an answer, but swarming may be probably one of the ways to do that.