 Hello everybody, I'm Andrea Gallo, Vice President for Segments and Strategic Initiatives at NINARO, and I will be talking to you about accelerating deep learning neural networks at the edge. I'm not going to give a training on neural networks or machine learning, hope I'm not disappointing anybody, and I'm not going to give a training on edge computing, rather the implications of connecting the two words. Let me start with the disclaimer. Every information in this talk is public information. I'm not breaking any confidentiality agreement with any of the members of customers at NINARO. I have tried to add a URL to the original source of the information in almost every slide, and feel free to contact me at the end if you need any further pointer. We are focusing on deep learning. Deep learning is that subset of machine learning and that subset of artificial intelligence that focus on recognizing objects or patterns. So recognizing a dog, a hat in a picture, recognizing words or sounds in an audio file, text, signs, road signs, or objects, if you think about in industrial use case, recognizing a screw or a bolt. So this is really complex. It's all about a lot of complex algorithms that can detect how many edges, how many objects, how many blocks in an audio file, and then applying a lot of algorithms in a sequence one after the other doing a lot of really complex mathematical operations, matrix operations, convolutions. And then you come up with some probability that that may be a bird, that may be the beak of a bird, or maybe it's a cat. You can go to the link at the bottom of the page. That's a really great training provided by the CAFE team. All these machine learning and deep learning has been booming over the recent years. 2015 is really the year that we shall look at. 2015 is the year when TensorFlow, PyTorch, CAFE, MXNet all had their first comets on their githubs. This is really thanks to the increased capabilities of the compute processing in the CPUs on the server side. This is thanks to the increase of the compute capabilities in the GPU. And this is thanks to the cloud, because every time we upload a picture on our favorite social network and we tag our friends, we are triggering machine learning training algorithms that use all our pictures to understand how they can improve their algorithms in better detecting the objects that we are tagging. So the cloud is one of the enablers. And all these frameworks all coming up in 2015, thanks to the compute capabilities. With these, there are constraints that we shall be aware of, that you are already aware of. You shall be always online, because you are sending information to the cloud where all these machine learning magic happens, and then you get results back. So you shall be online. There may be constraints of uplink bandwidth, speed, amount of data that you can upload to the cloud. There can be concerns and constraints about latency in real time. You need to detect a bolt, or you need to detect liquid spilling from an industrial use case in a production line in real time. And the cloud may not be fast enough. There may be concerns about privacy, about uploading sensitive information, sensitive images to a cloud somewhere rather than keeping it in your offices. So this is where the edge computing concept comes in the picture. This is where you deploy processing algorithms, payloads, workloads, and machine learning algorithms at the edge, right there next to where the data is captured from the sensors. So I'm trying to set the ground. I'm not going any deeper into the machine learning algorithms themselves, or into the concept of edge computing themselves. There are plenty of interesting talks here at the summit. When it comes to the edge, this is happening already nowadays. And this is thanks to the latest application processors in all modern smartphone. All these application processors, they have some sort of neural network processing capabilities. And the next is when these capabilities get to the tiny microcontrollers. That is the second huge wave of revolution in bringing machine learning to the edge. Not just in your phones, but in edge devices that are low cost application processors or tiny microcontrollers. That enables a disruption of machine learning. And it's not just the microcontrollers, the application processors. There is a huge ecosystem of companies that are designing IPs and acceleration blocks to perform neural network processing. So on one side you have a lot of machine learning frameworks. We will look into a couple. And there are a lot of neural network solutions that accelerate the processing, make the processing possible in tiny system on chip. And so the challenge is how to connect the two words. And in the rest of the talk I will try and dive into the complexities and the challenges of doing machine learning at the edge. Let's start from the frameworks. The first one I'm sure the most known is TensorFlow. This is developed by Google, in-house by the Google brain team. We started actually in 2011 as project disbelief. And then in 2015 it evolved into TensorFlow. And it took two years to get to the first stable grease. Well, actually November 15 to February 17. So it's one and a half year really. TensorFlow is actually three projects in one. TensorFlow can be built directly, natively, for the cloud and the data center. TensorFlow can be built as TensorFlow Lite for mobile devices. TensorFlow can be built as TensorFlow.js, to have machine learning in your browser. And all these variants of TensorFlow, the TensorFlow framework has a sort of an app store. If you visit the TensorFlow GitHub, you will find a lot of examples, documentation, training, and a lot of models. The models are those algorithms that do object detection, speech recognition. There are tons of models. And on the GitHub you can find a lot of available models that you can reuse. So it's kind of an app store for the machine learning side. TensorFlow supports all sorts of acceleration. GPU in your data center or your desktop. The hardware acceleration in Android using the NNAPI or the GPU underneath your browser with WebGL. And TPU. TPU is Google on a hardware accelerator for TensorFlow. I'm sure you all know it. I like using OpenHAB. OpenHAB is a website supported by BlackDac. I use it to compare open source projects. I really like it. In looking at TensorFlow, we can see that over time, since the first commit in November 15, it had almost 32,000 commits, 1.6 million lines of code, then it is equivalent to accumulated effort of 456 years to develop. That's pretty impressive. Now pay attention to one key point to this chart. When you move from TensorFlow that you built for the data center and you want to reuse those models in TensorFlow Lite in your smartphone, you need to use a converter. This is because TensorFlow uses protobuffers and TensorFlow Lite uses flat buffers. These are different formats of managing buffers and all the information that is part of the graph. Flat buffers are easier to browse and jump from one point to the other into the buffer rather than parsing the entire buffer. So it's faster for mobile devices and newer. But you need a converter. I went into GitHub and looked at the first commit from 2015. This was by Manjunath Kudur, half past midnight on the 7th of November. This guy was working really hard. His first commit was almost 2,000 files, 344,000 lines of code. So that is when he started moving from project disbelief into TensorFlow. He was working really hard. And I found this picture. This is Manjunath from the Google AI webpage, specialist in artificial intelligence. So, well done to Manjunath. By the way, one year ago he left Google. He is now, according to LinkedIn, he is working for startup Cerebras still in stealth mode, but by the name and by the background of Manjunath, I bet it's related to machine learning. He will come up with. Kaffe is another well-known machine learning framework. Kaffe started as a PhD program by Young Queen Gea at the Berkeley AI Research Lab. Again, it can focus on CPU and GPU acceleration. It can work well for detection of sequences, speech, text, vision. And it has its own app store, the Kaffe Zoo, where you can find tools, reference models, demos, documentation, recipes. So it's another ecosystem. It's a much smaller project. 76,000 lines of code, 19 years of effort. It's much, much smaller. The interesting thing is that in 2015, so, again, 2015, Kaffe became Kaffe 2, again by the same PhD. And the claim is that Kaffe 2 is even better in distributing large-scale data centers and even better in supporting mobile devices. Yet, there's a different app store for the models. This is the Kaffe 2 Zoo. And there's a different format. And when you run from Kaffe to Kaffe 2, you need a converter. So we're looking at TensorFlow from Google, Kaffe 2, which is now endorsed by Facebook, and between them they have their own app store, but they even have tools to move from one variant to the other of the project. Kaffe 2 is much bigger than Kaffe. You can see from the stats, 275,000 lines of code is 8 times smaller than TensorFlow anyway. The first comment was by Young Queen for Kaffe 2, same as for Kaffe 1, June 2015, 11.26 pm. So Young Queen was working really hard as well. 51,000 lines of code added. So again, you see the evolution from Kaffe to Kaffe 2. And again, really working hard. Guess what? Young Queen is director of the Facebook AI infrastructure. Pretty impressive, and very well done. Recently, last March, a change started. The Kaffe 2 project and PyTorch project, both machine learning frameworks endorsed by Facebook, started a merge. You have the link both in the title and at the bottom for the chart, and you can find more information. But you can see it from this chart. This graph is showing the comets to the projects. And you see that from March, and you can look into GitHub, from March 2018, all the comets to Kaffe 2 have been rejected. So you see variants, you see changes, and you see things dynamic and very fluid. So it's really complex. Instead of digging into PyTorch, I'd like to look at the last one in this talk, and that is MXNet. MXNet is endorsed by Amazon Web Services, AWS. MXNet is another machine learning framework. Again, it can scale to the data center to a large amount, of course. It can work well on the mobile. It started from a university effort. It's coming from Carnegie Mellon, supported also by MIT, but University of Hong Kong, Washington, and supported by many impressive companies, Intel, Baidu, Microsoft. MXNet has its own app store, again, or model store, I should say. This is the GLUON model zoo. The first comet was April 2015. So you see, 2015 is a recovery here. TensorFlow, Caffe to MXNet, all in 2015. Then we can debate which one was first, but they are all in 2015. It's an important year. The first comet was done by Mouli, half past 4 p.m., so normal working hour, not as hard working that day as his peers. Zero lines of code added. Three files. Three files. And you can find the files in the GitHub. Read me. License. Git ignore. Well, you need to start the right way, right? Well organized. I will not disappoint you. Mouli did a great job. He is the top contributor to MXNet. You can see from the stats. I was just easy, but impressed by all these guys. And this is Mouli. Guess what? Principal scientist at Amazon. I'm impressed. These guys are all bright minds. If I just do a quick comparison, let's look only at the line estimated cost. Again, this is coming from open hub. The estimated cost is huge. The lowest is 4.6 million dollars. 4.6 million dollars is the smallest cost. This is the estimated cost for accumulated development. Looking at the Git history and the complexity of the code. And TensorFlow is 30 million, almost 30 million. This is huge. I will not spend much time on these charts. You can look at them from the links in the page. But this is showing the growth of the source code base and the green step for PyTorch is what we are referring to for March when CAF2 started being merged into PyTorch. Then we can see it from the GitHub history. GitHub is the truth. This is showing the number of commits per month. And so it represents the lively activity of an open source project. So let's try and do some observation. Every cloud player has its own deep learning framework. Every big cloud player. Google has TensorFlow. Facebook has CAF2 and PyTorch. Amazon has MXNet. If we look at the Asian side, Baidu has Paddle Paddle. And the other big Chinese players have their framework. So every big cloud vendor has their own preferred and endorsed machine learning framework. And every framework is an entire ecosystem in itself. Every framework implies a model store, implies different training, different make files, different tools, different formats. So when you join and you choose one framework, there's a significant investment. It's not a lock-in. Everything here is open source. If you look at the stats that are provided, every framework actually is available in source code under an open source license. But the fact of the complexity is that it's really hard to move from one to the other. And last but not least, the most important for the rest of this talk is that every framework is focusing a lot in offloading the CPU and in acceleration. If you really want a cool job and a cool job title, do invent a new great machine learning framework. You will be director at one of the cloud vendors, maybe. Yes, as your Microsoft has the CNTK. Let's now move on the other side, so the acceleration side. On the cloud side I mentioned TensorFlow as first and I mentioned TPU, their acceleration for the cloud. So the first thing I will mention to you is that at the edge, Google a few months ago announced their edge TPU. You can go to the link at the bottom of the page and register and hopefully you will get some information in the future because this is all still in the pre-view mode. There's not much available as public information. But from what we see from the information available, this is a purpose-built ASIC chip designed by Google to run TensorFlow Lite machine learning models and networks at the edge. So it's optimized for efficiency, for low power, for TensorFlow Lite, actually. It's available as a dev board and as a standalone accelerator. Amiga to see more public information becoming available. Moving into the ARM ecosystem. The very first way of accelerating or floating the CPU for machine learning computation is by using the ARM GPU. The latest is the Mali G72. These are screenshots from their web page. So I will not go into the details. I will leave it for you, you have the link. But this is the same GPU that is used to render the graphics of your best games and your mobiles. And it is very well suited also for machine learning algorithms. The next step from the ARM ecosystem is ARM-owned machine learning processor. This was announced last February. You may recall the name Project Trillium. This was announced at the Mobile World Congress. Here, again, public information, you have the link, ARM reused their best knowledge from the microcontroller world, the processing world and the GPU world and merged into a custom built IP. You see that these IP has some blocks that are fixed function Mac engines and others that are programmable logic. Some neural network processing can be directly hardware accelerated with this Mac and other functions will have software fullback. So there must be some microcontroller in it, running some software. There is local memory and then you see that the interface to the rest of the world is an external memory system with DMA. So it means that this processor can cope with an entire machine learning graph. It can load the weights, the input samples and produce the result owned by itself contained. It's a complete machine learning processor. There is a variant that's an object detection processor. This can achieve full HD video at 60 frames per second and recognize objects. So I assume one of the use cases can be surveillance cameras for security. All this is supported by the ARM NN SDK. It's available as source code on GitHub. It can consume the models from the major machine learning frameworks and it has a runtime inference and it uses the compute library which is optimized for the ARM cores and it can offload on the CPU, on the GPU and if you look at this diagram, not only the Cortex-A, the application processor and the GPU but also the ARM machine learning processor, also third-party IPs, this is really important and also if you look at the far left, there is a variant of the compute library that is optimized for tiny microcontrollers and this is running on a Cortex-M CPU. Moving on, another important player, Qualcomm. This is the diagram of the Qualcomm neural processing engine solution. Here you see in the middle CPU, GPU, DSP. So this is really a heterogeneous computing solution. You see that there is a runtime and the neural network processing can be distributed at best over the available resources. It can read models from TensorFlow, Kaffe and it uses tools to convert to an internal optimized format. Another one, very important player is High-Silicon. High-Silicon is the silicon division of Huawei. This is what comes with the latest Huawei P20 phones where the AI improves the quality of the pictures that you take. This is using the high AI APIs. There is not much information on the internals but one of the product managers from Huawei was our guest, who are Linaro Connect conference in March 2018 in Hong Kong. You have the link to the video and he explained some of the internals of the solution. And the NPU that you see on the right can accelerate up to 99 operators and they interface with Kaffe, TensorFlow, TensorFlow Lite, Android and there are converter tools. I could go on with more application processes and really the purpose here is to give you an understanding of how things work and here again some logos from all the companies and start-ups that are coming up with neural network solutions. You have some IP vendors, GPU vendors, you see very silicon imagination, ARM synopsis. There are companies providing NPU, neural processing accelerators or maybe deep learning accelerators, DLA, Cambricon, Bitmain. Cerebras is the company in start-up mode, step mode that a manju not from Google joined one year ago. There is a lot. So again, how do we do the connection? You see that everyone is using tools to link to the frameworks. If we do some observations here, there are very different options. Some are complete offload systems all by itself. It's a complete machine learning processor or a neural processing unit, whichever term we use, with their own memory and you move the entire graph. On the other side, you have distributed accelerators that can offload portions of the processing. I'm not here to say to judge which solution is best. In one case, you offload everything. In the other case, you have more, maybe you have more pressure on the memory bandwidth because you are using the same memory, but maybe when you have the software fallback for some of the operators that you are not accelerating in hardware, you use the main host CPU rather than an embedded microcontroller. You have pros and cons of every solution. But each one needs converter tools. And even more, as of today, everyone has its own runtime. I hope you have appreciated from all the different accelerators that I showed you. Everyone has their own software libraries, their own SDK, their own converter tools. If you look at the frameworks, the machine learning frameworks that we have analyzed and you use the accelerators that we have just gone through, how many forks, maybe fork has a negative cost, but how many variants, how many runtimes, how many tools for each framework, the total cost of ownership is huge, it's hard to scale. These delays, security fixes, these delays rebazing and updates. So what can we do to improve all these? And this is really a call to action. This is where, I'm from Linaro, this is where we believe that at Linaro we can help. Linaro is a collaborative engineering company. We are funded by all the members that you see on this chart. All these companies are funding our work with some membership fee and with some engineers that become part of our virtual teams. We were funded in 2010 by six funding members. Now we have more than 25 and overall we have 300 open source engineers. We lead the collaboration in the ARM open source ecosystem. We work on open source projects. We run the company in profit neutral mode. We use the fundings from all the members to hire the best maintainers and the best tech leaders. At the end of the year we shall not make any profit. Otherwise we would just pay taxes on the money that we receive from our members and we would be wasting their money. So we run the company in profit neutral. At the end of the year we shall get to zero. We shall not lose money. We are judged by our impact to open source projects. You can see some of the projects that we contribute to. The key one, the Linux kernel, over the years we have consistently been in the top three contributors to the kernel. This is the way we are measured by the contribution to open source projects. So when it comes to machine learning when it comes to accelerating deep learning at the edge. A few weeks ago at our Linaro Connect in Vancouver we announced our Linaro Machine Intelligence Initiative. We want to achieve a common model description format and common APIs to the runtime, to the inference engine. And we want to provide an optimized open source runtime inference engine that is optimized for the ARM platforms. And this framework shall have a system of plugins so that we can support multiple NPU, CPU, GPU, DSP from the third parties in the ARM ecosystem. And this solution shall run in our lava lab. There was an interesting talk by David Pig this morning about lava. So you can find hopefully it was recorded. It's really important that the boards from our members and the software we are going to build run in a CI loop and that we consistently verify the performance and accuracy of any optimization that we apply versus the expected accuracy of the reference tests and tensors from frameworks. Imagine that a given machine learning framework when running in the nominal case you have a cat and it recognizes the cat and then you run it on an embedded platform with some acceleration that is badly designed and the cat is not recognized as a cat anymore but becomes a dog. That's a major failure. So it's really important that all the changes are in a CI loop and we consistently measure the accuracy of the output. And in real time we can detect if there's any regression. When we announced this machine learning initiative we had a very supportive statement from Google that was endorsing the initiative. Pete Warden from the Google TensorFlow team and at the same time ARM as funding member of this initiative donated their ARM NN what we have just seen that runs on the ARM processors. ARM donated it to this initiative. This means that up to now the ARM NN framework has been developed inside ARM and released a source code every quarter as a big code dump to GitHub. So source code but not really an open source project in the open. The donation tool in R&D to this machine learning initiative means that we are setting up as we speak now we are setting up the GIP infrastructure the Gerrit setup, the CI loops so that all the work will happen in the open and everybody will be able to access the patch and will be able to provide patches we are setting up the mailing list in the IRC channels right now. For the API and the format Onyx is the best candidate we are looking at. On one side there is Google TensorFlow TensorFlow Lite that is one API that is really hard to not to comply with is one of the de facto standards and then all other frameworks all other machine learning vendors they are all collaborating together in Onyx. Onyx stands for open neural network exchange format and it has also an Onyx Onyxify is the API it's supported by Facebook by the Apache foundation by Amazon, Microsoft Apple all are supporting this and there are also some offline Coventa tools for TensorFlow flow anyway so we are seriously looking at Onyx as possibly the format and the API into the frameworks to reduce the fragmentation and the ARM and NSDK with its plugin framework is what we see as a common inference engine and the plugin framework for all the onyx elevators all these started last March so everything is really new for us we started last March at our Linaro connect you see here the link of some of the talks that happened that week few weeks ago in Vancouver we had our Linaro connect again that was the fall connect and here you have the link this is not an eye test this is the back of the room these slides are available from will be available as soon as I upload them from the conference page so you can look at all the links and you will see all the links point to the videos for each of the session and we had a great keynote by Jim Davis that day Jim Davis is the ARM fellow general manager of the machine learning group at ARM and here by clicking on the YouTube icon you will be able to watch the keynote where he announced the machine learning framework so please stay in touch with us, thank you we have a couple of minutes if there is any question before we all run and have a well deserved beer or glass of wine I just wanted to mention also because I didn't saw on any slides that Intel also put some effort to put different chips that support machine learning so probably this picture is even more complicated to be honest with the different chips thank you one last question or beer ok, thank you very much