 Thank you very much. Thanks, Genghis. OK, thanks everyone for being here today. Can you guys all hear me OK in the back there? Cool? All right. Right, so my name is Neil Tan. I'm here to basically try to answer you guys five questions about microtensor. Basically, the question are who, what, how, why, and where? So let's start with who first. We are software engineers, data scientists, and physicists, and researchers, just a small group of people who share two common interests, machine learning and embedded systems. So what do we try to do here? Essentially, we are trying to move, like, make TensorFlow inferencing to run on the smallest device possible. So in this case, the microcontrollers. I'm showing this particular microcontroller here just as an example, because this is the microcontroller view we'll be using for showing guys one small demo after this one. The microcontroller, in this case, is running at 100 megahertz, has then much less than one megabyte of RAM. This is actually 320k. It's actually quite large for microcontroller, right? But it has a touch screen and everything, so let's give you a break. Has a flash memory, but the most important thing here is the power. Some of those microcontrollers are meant to be run on Quancell battery for, like, months. So compared to running things on maybe PC, GPU, or cloud, personally, I think this is probably one of the most important differences here. It's a power consumption. Now, let's just look at a brief demo. So what you'll see here beside the beer and banana. Well, this is the microcontroller I was showing, and that's the touch screen. We attempt to draw something there, press a button, and see if it goes back to the correct result. So that's me writing stuff with another developer on this project as well. OK, so you can just barely see a drill of three and return the three as well. We only have so much time to make this demo, so we didn't even have time to figure out how to print the proper text bigger. But if you squeeze your eyes, you cannot see it as an eight. So on the slides, if you guys want to check out the demo, there's a link here. So this is basically the neural network we have been running. It's really basic, just a multi-layer perceptron, four deconnected layers. It's trained on the MNIST data set, and the lowest memory footprint we can put this on the microcontroller is about 256K. But if you do the math, it's actually taking less than 100K. It's just other things that we need to get a touch screen and whatever operating system running does also take the RAM. But there's a lot of room to be optimized here. And let's talk about how we do this. So there's a few techniques in use here. The first one is quite important. It's called quantization. As you know, most of the graph in TensorFlow is in floating points, which each value takes about 32 bits. But Peter Warden, if you're Googling up, has a really interesting blog, which tells you how to convert the floating point into fixed point, which pretty much goes from 32 bits into 8 bits. So it turns out that for inferencing, this is pretty much all you need. As a result, we save about 75% of the memory. And another major difference is that in most machine learning framework running on Cloud PC, we have the luxury of memories. So the result is a lot of unused tensors or buffers, data, are being kept in the memory, even though it's not used for computation. But what we have done in Microtensor is use dynamic tensor allocation, which we work out exactly which tensor is needed and load it from the storage or just kill it from the memory to guarantee that you have maximum room for computation here. And because Microtensor is so tiny, you really want to guarantee that you're using the minimum amount of code and memory as possible. That also includes we're trying to work around from having specific functions to parse the graph and do other things. So what we did is we would just get the TensorFlow graph and just convert it into C++ code. We'll talk about that a little bit more in the next slide. There's no dependency on math libraries, such as IGNs. We pretty much rewrote many of the mathematical routine to guarantee the smallest memory footprint possible. I mean, at least that's what we are trying to do now. We use embed for rapid prototyping on MCU. If you guys never heard of embed, I encourage you guys to Google it up. It's pretty awesome for developing for microcontrollers. So this is the workflow here. We first collect the data, fill the data into TensorFlow, make your network of choice, and we train your model. And that turns out to be a political buffer file, a graph. MicroTensor takes the graph, generates the C++ code, and also it has a runtime library to be included and imported into embed. Then you get compiled, and you flash and run on the microcontroller. That's basically the flow. And just to reiterate on that one, this is actually the actual command as I was trying to run and make the RIME file for the project. You will give a graph to the microcontroller CLI, command line interface. And then this little program here extract the parameter from the graph, which is your weight, and save it somewhere for you to import onto your device. Also, it generates the C++ file you would need to do an inferencing on the microcontroller. And this is roughly the resulting binary in terms of the size, all right? The whole pie chart here is about 300k for the binary. But this is not the most up-to-date result. As a matter of fact, just a few days ago, one of the older, Michael, back in Austin, was managed to shape about 66% of the size of the original 300k. So the whole thing right now is about mostly the same from this portion. The whole thing is about 100 something k. But the idea is you can kind of see exactly, this is the micro tensor part here. You get the micro tensor library, and you have the generated graph. The whole thing for the MLP network I was presenting earlier is about 26 to 30k. So that's roughly a size we can expect. And there's further information to be done here, but this is the current state of the work right now. As for the design, micro tensor have three classes. Let's talk about the two classes first. Operator class is pretty much what you can think of as like convolution, metric multiplication, add, subtract, stuff like that. It would take multiple tensor objects here, and output to multiple tensor objects, pretty much the input and output. Right now we implement the C reference implementation. But in the future, hopefully as you guys or your old developer is interested in this, you will have like SIMD implementation or SPI if you have external accelerator for your computation. We have no idea what kind of hardware will be going out there, so we wanted to make this future proof by defining good interface to interface with whatever implementation we're going to face in the future. Also, you can implement network interface here, which can potentially do distributed computation here. For the tensor class, right now, we only implement the RAN tensor. And Kazami sitting right there, as another developer we have, is implementing or he has already implemented the paging system, which allow you to paging out of, say, the flash memory on the micro controller, for example. The idea is to allow you to run any size of deep learning model we have right now on the trade-off of the computation time. Let's talk about why they are doing this in the first place. Right, so this is a graph. I sort of have everything organized not to scale. This is how I have things organized in my head. On the horizontal axis here, we have the cost area. If you look at different platform here, you have cloud, GPU, CPU, application processing unit. If you go toward that direction, toward cloud, it's more costly. But then, at the same time, on the vertical axis, you will have more performance. What we have seen is that for most machine learning, while training is a bit different, but it is for inferencing, most machine learning stuff has been relying on the right-hand side of the slides here. So you get cloud, you get GPU, CPU, and APU here. But there has been a barrier, really, to move inferencing from all the good stuff onto the MCU. So that's probably due to the tools. Back in the days, we didn't really know how to move the machine learning stuff onto here because there's really no dedicated tools in place for us to do so. In fact, when we were writing Microtensor, we kind of needed a few tools to sort of help with this. And the algorithm, quantization and stuff, weren't really in place at the time. Most of the machine learning framework are optimized for accuracy, but not really for the size. And the speed. Back in the days, MCU is not as fast as today. So it kind of makes sense to make it possible that MCU is faster now to do inferencing on the MCU. So some of the application of device AI include sensor fusion. You can fuse multiple sensors together. And for IoT device, because the bandwidth are typically low, you cannot resend the full image or all your data back to the server. So it makes sense to do inferencing on the device and then just send the result back to the cloud. And MCU typically have a long standby time because it's a low power consumption, lower cost. We believe all this point would enable new use cases in the future. OK, so this is the last slide we have. So pretty much where. As you guys can see that the GitHub address is here. If you guys are interested, please check it out. We are currently just a bunch of embedded developers and data scientists. But we welcome all discipline to join this project in order to use it. PRs are welcome. Thanks, guys. Thank you very much, Neil. We have plenty of time for questions. Any questions? Anyone? OK, simple question. So you train your model using floating point and then transpose it to 8-bit. Or you need to start your training from assumptions that you will need integer numbers all the way down. So the training is done on the simple TensorFlow. You need to adapt it to TensorFlow for training. OK, that's a very good question. We do all the trainings on TensorFlow. What happens is if you train any graph with TensorFlow just a regular way, TensorFlow actually provides some script for you to do the grass freezing and quantization. So when you do the quantization using TensorFlow's tools, you will insert a bunch of quantization operators in your original graph. And what we did here is also porting those quantization operators to the micro tensor. That's why they are compatible. But yes, training is done in floating point. But there's tools for us to convert everything from floating point to fixed point to run on the MCU. Question? So do you plan to support more complex networks, like convolutional neural networks, this kind of things? Yes. But we are all doing this on our spare time right now. But we have defined interface for everyone who's, well, hopefully make it easy enough for people who want to implement their own operators. You can just easily just take a call and start sort of implementing it. But yeah, as time goes by, we expect to see more and more operators being implemented. And last question. Have you seen TF Lite by Google? Yes, if you saw it recently. To be honest, I think it's a really, really cool idea. But I haven't really spent personal time exploring. But it's definitely one area we should explore. No TensorFlow Lite is probably not currently supporting MCUs. So that could be one of the major difference here at the moment. OK, thank you. What is your most restrained resource? Is it from your operation? Is it CPU, RAM, Flash? OK, that's a very interesting question. When we were designing the thing, we were kind of assuming the use cases might not be so time-sensitive. So it's going to be near real time. But it's not going to give you like 20 frames per second or something, right? So keeping that in mind, I would say it's a RAM. Basically, the more RAM you have, stopped by SGN and convolution takes a lot of memory. So the more RAM you have, the less power it consumes. And the faster it runs. So personally, I think that's one of the key things. And growing your networks, it will again be RAM, your major scaling. You mean if you want to make a bigger network? Yeah. Yes, I would say so, especially in the fully connected layer. The measurement multiplication thing is just so large. But as I was mentioning, Kazami has sort of implemented this virtual memory or paging thing hopefully to make things a bit easier. There's always a trade-off between designs. But yeah, that's why we're trying to do it. Thank you. Any more questions? OK, thank you very much, Neil. All right. Thanks, guys.