 Hi, without much delay, I'll just get into the talk. So what I'm going to talk about today is device-aware computing. So my pitch is going to be that when you program a device and embed a device in specific, you need to be aware of what your device has in terms of capacity and what computations it can do. So yeah, why do you need to be aware of your device? You can extract maximum performance. You can avoid unnecessary upgrades in hardware. And you can also save costs. Like if you upgrade your hardware, there's going to be a cost of porting. And you can also do more with the same computer. So yeah, today's demo will be on Intel Edison. So the reason why I chose Intel Edison is it's a pretty popular single-board computer. It's used a lot in IoT and image processing. It's used in drone controllers. And it's also capable of performing a lot of compute-intensive operations. And I've also used it for deep learning. So yeah, it's pretty powerful. So the optimizations that I teach you today, one of it is OpenMP. So this will be the only optimization that will require you to maybe add something to the code. The other two are just simply compiler flags. So yeah, I'll just show you one by one. So if you see OpenMP, it's a multi-threading library. And yeah, the device awareness you need here is that your Intel Edison is dual-code. It's dual-code and dual-headed. So it makes sense to use threads to run your code in parallel. So if you see the code here, it's actually a simple FFT code. I'm just showing you a snippet. The only change you make is this one line. So the second you add this pragma, the compiler knows that the for loop here can be run in parallel. So it will try to parallelize this for loop. And since it's doing nothing but random number generation, so the threads can run asynchronously. So this is one way where you can extract good performance from your device. And while compilation, the only flag you need to add is FopenMP. And this is not specific to Intel. You can use this practically in any board that has dual, I mean, multi-code and multi-files. So the next optimization is SSE. So SSE is Intel's streaming SIMD extension. So just like ARM has its own unit called Neon, which does parallel processing. Intel has its own unit which uses SSE. So SSE is basically an instruction set. And it tries to auto-vectorize your code. It will look at your code during compile and see if there are any loops that don't have branches. And if there are loops that don't have branches, it's gonna vectorize them in the SIMD unit. So yeah, this can be done totally by using just compile of that. So I've used SSE, SSE 2, SSE 3, 4.1 and 4.2. So how do you get to know whether this is there in your processor? You can actually do PROC CPU info. If you're using a Linux system, you can do PROC CPU info to tell you whether you have an SSE unit or not. And the trade-off here is you can't vectorize unless loops have branch instructions, it won't vectorize. So I think the wording is wrong. If the loops have a branch instruction, it won't vectorize. So that's the trade-off here. So you should make sure that your loops don't have branches. And the third one is, I think this is pretty popular GCC compiler flags. So even in compiler flags, you can be a little smart to tune the performance according to your device. Like one of the compiler flags I used here is M-tune atom. So this is telling it that tune the code for the atom CPU. You can even mention your architecture, like you can say the architecture is also atom. And you can also look at your processor specs. And again, trial and error also plays a big role here as to what compiler optimizations you use. Like Intel Edison through my trial and error, I felt it's much better for fixed point operations as compared to floating point. So you can use aggressive loop optimizations over here. So the FFT that I'm gonna show you later is a fixed point FFT. If you use floating point and if you use this particular optimization, unsafe math is gonna have rounding of errors. So you need to be aware of what your device is good at and what your device is bad at. And the trade-off here is, if you use these type of optimizations, it can impact binary size, RAM usage, and a lot of other things. If you unroll your loop, your binary is obviously gonna become big. So yeah, before we come to this slide, I'll just show you a demo. Whoa. Okay. Yes, I'm connecting to my DC. So I haven't written the FFT code myself. I'm using a library called KISS FFT, which is a fixed point FFT library, but I'm optimizing it using the three optimizations I showed you. Yeah, so first it's just compiling an unoptimized version with practically no optimizations. And in the next time, you can see it's compiling with all the optimizations I showed you. So now let's just quickly run both of these. I've put a software timer, so I'm running the code in a loop that I tried 10,000 times. So you can see it's taken roughly 6.4 milliseconds without optimizations. So let's run it with optimizations. Yeah, so you can see it has reduced to 4.6 milliseconds. So this is the performance improvement you've got just by using the three optimizations I showed you. What can you do more? You can rewrite the code to be SIMD friendly. Like if you have loops, try to remove branches from the loops as much as possible, remove printup statements. You can also use smarter compilers like vendor build compilers. So ICC is Intel's compiler. I tried using it in Edison, but yeah, I didn't have enough memory. You can overclock and underclock your device. It's again, not possible in Intel Edison, but in things like Raspberry Pi, it's very popular. And you can run your code with minimal OS modules. Like if you're running Linux, you can cut down your modules or you can run bare metal. So that's also gonna make your code faster. That's it. Any questions? You have many projects, sir. Actually, I'm currently building a project with the Intel Edison and Raspberry Pi. It's a machine learning algorithm to classify mosquitoes based on their sound. Is there anywhere we can see that? We are planning to, we are still writing a paper. So once we publish the paper, I think it's... Honestly. All right, cool.