 Hello, everyone, and welcome to my talk. I hope you had a good session in the keynotes this morning. And now I'm happy to take you on a journey from the world of cloud native computing to the world of edge computing. And in this journey, we will learn how Kubernetes can be used to control and manage AI or machine learning pipelines on the edge devices. A little bit about me, I work as a solutions architect in one of the best teams within Cisco called emerging technologies and incubation. My talk today has two sections. I will first talk about why AIoT, what is important about it. And when we bring AI and ML together, what are the emerging behaviors? What are the patterns that help us solve those behaviors? And I'll also show you a comprehensive reference architecture on how to build an AIoT ML solution, along with the reference implementation, and then a live demo. The demo is the most interesting part. I had a lot of fun building all this hardware. Spent many weeks to do it, three hours and sleep. But I will share all the journey with you. And this is one of the reasons why I was able to come up with a reference architecture. The lessons learned in the journey of building this demo is what shows up in the reference architecture. So what is AIoT? We all know and have heard that the number of internet connected devices is growing each year. Most industry analysts say that by year and next three years, we will have close to 55 billion devices. But that's where we reach a point of diminishing returns. What we find is the more devices we get in the correct internet, the less insights we're getting. And at the same time, we're also seeing that as these devices and they connect to the internet, the cybersecurity attack vectors keep increasing progressively. So what is the solution? The solution lies in looking closely at what do the devices do when they connect to the internet. The primary reason to connect to the internet is to push the data to the cloud tier so that we can generate insights and power our business applications. And then analyze the solution. What if the power of making insights and inferences is brought down closer to these embedded devices? But that's easier said than done. Once we do that, there are a lot of problems that arise in putting them together. And that's what I'm going to talk about in the emergent behaviors. The first thing that we see is as soon as we start to think about bringing machine learning to edge devices or embedded devices is the computational complexity of running machine learning on embedded or resource constrained devices. It is a big challenge. Most machine learning frameworks are just too bulky to run on embedded devices. And we also see that there are insufficient metrics to measure the performance of a machine learning algorithm. The traditional Flops on Macs are very generic. When you go to a particular, let's say, a TPU accelerator or a GPU device from a particular vendor, you see those metrics are just too coarse-grained. They lack the fidelity to measure performance. And measured performance is key because these devices have limited power, bandwidth, computational resources. And we also find as machine learning is brought closer to these devices, the optimization strategy is that we frequently use on the cloud tier erode the training, the model accuracy, and that it leads overall to degradation of the model. And that's when things around drift detection becomes important, when you do this, and you bring machine learning to the edge. And what we'll also see is that there are multiple incompatible computational architectures on the edge tier. The biggest challenge I had doing this was a lot of the stacks don't work on ARM64. It works fine on AMD or x86 space machines, but on ARM it just doesn't. And you also see that when you run containerization on ARM-based devices, a lot of things that work in normal container did does not seem to work on ARM devices. And I'll talk about some of those in my talk and in the demo. My presentation froze. Is it going to the next slide? I'll give you a second. I think I'm having some problems controlling this. We need some help here. I'm good to go now. Sorry about that. Thanks for being patient. OK. So one of the key problems that I encountered as building this demo and looking at how we can bring machine learning to the edge tier was the complexity, as I said, of these algorithms. So let me bring up this complexity chart. So I just put a little table where I show some of the most commonly used machine learning algorithms. And the one I'm using for this demo is a logistic regression algorithm. So if you look at it closely, what we see is the columns here, the first column, the size shows space complexity. And the training inference is the time complexity of the algorithm. So notice here that the training complexity of logistic regression is a polynomial time. But the inference is linear. And that's in where lies my primary hypothesis. And so the demo I will show that. So my hypothesis is that the computational power on the edge devices is sufficient to run inferences. But we also have to bring the training down to the edge tier. And to do that, we have to think about the entire machine learning ops pipelines that can run on edge tier. It is not sufficient to run the pipelines on the cloud and then wait the downloads to happen on the edge tier. The entire pipeline should be running on the edge tier. And to do that, I uncovered some solution patterns that I'm going to talk about now. The first and most important pattern here is stick the machine learning pipelines. And there's a lot of complex dependencies in the steps of the machine learning pipelines. You express those as DAGs. So there are a few tools. And I'm going to show one of the tools called ARGO to express how to express the dependencies as DAGs. You guys familiar with DAGs? Show of hands? Great. You're an expert on that. I know that. And then use the pipelines for, yes, question. And use the workflow pipelines for continuous evaluation. What I mean by evaluation is major drift and the validations continuously on the edge tier. And then train the models and produce the actual models on the edge tier. And then use, and this is one of the very interesting things that I found is, and that's what Kubernetes is really helpful, is use hardware accelerator aware pod placement strategies. What I mean by that is using the rich expression and syntax in Kubernetes to using tints and affinities and anti-affinities and labels, you can basically change the way a particular workload that, let's say, requires a GPU-based acceleration versus a workload that requires a TPU tensor acceleration, you can place them and give the right clues to the scheduler on where to place them. Because in the edge world, not every device is uniform, unlike the cloud world where each VM is the same. On the edge, we have to be very careful on where things run. And some of these libraries are so tied to the hardware that if Kubernetes places them on a different machine, it just won't work. And some other architectural patterns are how to containerize the workloads. And this is easier said than done. On the edge tier, running containers is pretty challenging. And you will see in some of the demos and the code I'll show you is, the IP tables, the basic networking, it just doesn't work correctly the way it works on the cloud tier. And then another pattern I think is very important is using a streaming API sidecar. And I'll talk about more in the reference architecture. The sidecar allows decoupling inference that runs on the edge tier from communication. So the sidecar talks to Kafka's and NQTTS and all the communication pieces. But you can keep the inference module just focused on doing the inference. It only has the machine learning libraries in it and nothing else. And there's a reason for that. Because these libraries, as you put more stuff on them, they become very bulky. And you won't be able to deploy these containers on these machines. You need to decouple them. And then automate the orchestration of containers. There are some other embedded ML patterns that I leave on the slide deck. And you can read at your leisure later. So now let me bring all of this home. There are a lot of moving parts here. So I'm going to talk about a reference architecture. And I spent a lot of time painstakingly putting all the pieces together. Because it is comprehensive. It has a lot of parts. So I'll take some time to go over it. So this architecture for running AIoT ML ops on the Edge tier has four hardware tiers. The first tier is the training tier that runs all the ML pipelines that get the data extract, validate, train, transform, quantize, and all the good stuff that happens on the training tier. The second tier is the platform tier. This tier is what hosts various services such as a private container registry, which is important for Edge tier because you just can't be baiting on Docker Hub and Google container registry to download. There is no access to that on the Edge tier. Or even if it is the bandwidth is limited. Then there is an OTA ML code repo. I'll talk more about OTA is over there. Your phones have a photo when it gets to your firmware. On these devices, the Edge devices, especially the ones that are in microcontrollers, they don't have file systems. So you can't just download a model. You have to send the entire firmware to reflash it. So the OTA repo and the services, I wrote a custom repo in Golang to do this, allows you to download it to the MCU-based devices and reflash the device. And there are a few pieces that allow us to do protocol bridging. The MCU-based, microcontroller-based devices are using MQTT, but the rest of the architecture on the platform tier is all Kafka-based. So there's a protocol bridge and various other services. The third tier is the inference and the IoT tier. Actually, if you look at the folks on virtually connected, you won't be able to see it. But I'll show you a picture of what's set up here in front of the stage. You'll also see these tiers on this demo board here. The inference and the IoT tier has two class of devices. The first class is the Linux ARM-based devices. So these are the dev kits that are commonly used are NVIDIA Jets and Nanos, Raspberry Pis, Google Coral TPUs. So they run Linux or some distro of Linux. And the CPU's architecture is primarily ARM-based. So you can run most of the stuff that we are familiar with in the world of cloud computing here, so with the right containerization and the right stripping down of some platforms, they do work. The second class of devices is where things get interested. And these are the RTO Superloop-based devices. The folks know about, do you guys know about RTOs? RTO stands for Real-Time Operating System. That's what a lot of embedded devices run. Superloops, if you are familiar with Arduino programming, there's a big setup and a loop that runs forever. That's a Superloop device. And these devices are extremely small. They don't have file systems and gigabytes of memory. They need special attention. And the architecture and the reference solution I'll show you addresses those concerns. These devices run inference modules, embedded, and these inference modules then communicate over MQTT to the rest of the system. The analog tier is what you see on the extreme left hand side. That's where I have an induction motor connected with a bunch of sensors that takes the analog world, digitizes it, runs some digital signal processing to filter out inconsequential data or noise. And I also show you how you can run inferences, machine learning, to filter out noise in addition to DSP. And then so these are the four tiers. Now there are layers of services that span these tiers. So the first layer is the message broker. This is an MQTT-based broker. And then there are RTOs and Superloop OTAs. This allows us to download new firmware as a new model shows up. And then there are three other layers. The first or the bottom most layer is the event streaming broker. The next layer is the container management that spans all these tiers. And what brings and controls and configures all the pipelines together is an MLops workflow pipeline management. So that, folks, is the reference architecture. I know there's a lot going on here. So what I'm going to show now is an actual implementation based on this architecture. So I will not implement all the pieces in the reference implementation, but some of the key ones are there. And in the demo, you will also see how it actually happens in the real world. So it's not just in the PowerPoint. Can get a quick time check. Thank you. Okay, so here's the reference implementation. The key tool that I'm using here is K3S. It is a scaled-on version of Kubernetes. A shout out to Rancher folks, great product. It really made my things easy to do this on edge devices. Argo, great tool again. TF Lite and Stream Z. Stream Z is a Kafka operator that runs Kafka on Kubernetes. And it's pretty lightweight. So I was able to run it on most of the edge devices, except a few. So let's take a look at this architecture. So the, okay, so I have, I'll have to switch again to my main laptop because this is an outdated version of the presentation. I was making some changes this morning, so let me quickly switch. Hold on, folks. Okay, good. Okay, so back to the reference implementation. As I said, in the reference architecture, there are four tiers. Now the training tier that you see on the board in front of you is running on the Nvidia Jettison Nano device. This device turned out to be quite powerful. It has a GPU coprocessor and good amount of computational power. And this device hosts the entire training tier. So the ML jobs to extract training data, normalize it, detect drift, train it, and then quantize it. Are you guys familiar with quantization? Well folks, so quantization basically is taking the big full frozen graph that comes out of a machine learning training job. And you basically convert it into a format that can run on smaller devices. So quantization changes loading points, integers, and the whole process allows us to run the machine learning models on small resource constrained devices. The platform tier is running on a Raspberry Pi and a VM. I'm using a VM because I ran out of Raspberry Pi, so. I just couldn't, it's hard to get hardware these days. All supply chain constraints. So I just couldn't get the right hardware. So the VM here is, it seems to be a Bantu VM, and it's OS is very close to the Raspberry Pi. So it was a good drop in for another Raspberry Pi. The platform tier runs the services for the message broker, the protocol bridge, the firmware over there, photo services for the MCUs, and a private Docker registry. The inference tier and the IoT tier, as I said, has two class of devices. The first is an MCU based device. For this demo, I'm using an ESP32S device. That's what runs a real time operating system, which is hosting a TensorFlow Lite C++ module that talks to the rest of the system over NQTT. It also has a pre-processing stage, again using TensorFlow and a fast Fourier transform to filter out noise and remove things that just should not be sent to the rest of the tier. It just increases so much of computation that these devices do not have the capacity to run messages that really have no consequence. So it's better to filter them out, close it to the analog tier. The second class of devices that run Linux on ARM based hardware is a cluster of three coral dev kits. These three coral dev kits together will form a cluster and you will see how load balancing and HA and everything happens on this cluster, again using Kubernetes. And then the interesting part about this is the, would you see these lines coming towards the TFLM module? The way this, the works that I said earlier is that as this model, as there's a new model created in the pipeline, the photo service creates a new firmware binary and when the MCU gets an NQTT message, a subscriber message that there is new model or the model is still, the MCU reflashes itself with a new model. On the, on the Linux based devices, the model is compressed as a TF Lite file that you can easily download over an HTTPS link. So it's much easier there. The analog tier again has induction motor along with a bunch of sensors that monitor this motor for various conditions. And the edge, the platform services or the layers that span this, the conflict implementations for the messaging broker is a mosquito broker that runs on the Raspberry Pi. There's a Golang custom OTA service. And then for the Kafka stream, I'm using StreamZ. As I said, for Kubernetes, it's K3S and for the ML workflow pipelines, it's the Argo workflows. Okay, now it's time to demo this. So what you see here is this primarily for the folks connected virtually is what is displayed here on the stage. On the stage, you see the three tiers for training platform inference that have the Nvidia device, Raspberry Pi, a bunch of coral devices and MCU module. And on the far left side is the analog tier which has the motor and I've labeled all the key components there. So there is an ESP32 MCU. The primary job of the MCU there is to collect all the sensor data and transmit it over an NQTT message. And then there are a bunch of sensors around this device. So if I, so I think at this point, before I turn on this demo, I would like to take you onto the hook to see what is actually running on these devices. So the first I will show you what is running on the coral TPU device. And then I'll show you what is running inside the MCU device. So what do you see here is the code snippet of the Pi Coral module that is running on a container managed by Kubernetes on the Pi Coral device. So on the left hand side is the actual code snippets. You see these libraries, Pi Coral adapters. Now these libraries will work only on a device that has a TPU accelerator. And this is an interesting thing that when I was trying to run these libraries on a test device, a lot of these silently fail and you don't realize what's going on. So it turns out that there are certain C++ modules that load the TPU accelerator drivers that work only when it detects a TPU accelerator. And then the rest of the code is to load the TensorFlow Lite libraries, which anybody here programmed with TensorFlow Lite? Yeah, so this gets really bare bones. It doesn't have all the nice API and support that you see in carriers or other TensorFlow libraries. So you have to just go down to setting up the input tensors and output tensors, understanding the data formats and then running a prediction on it and then getting the output tensors. And that's what you see in a lot of the code here. On the right hand side is the TensorFlow Lite C++ module source code. So this is the Python counterpart of TensorFlow Lite that is purpose-built to run on microcontroller-based devices. So all these libraries are here and then again you set up the input tensors. Generally the flow of the code is the same. And these modules are running on the ESP32 device. Now the next thing before I start the demo is to understand the ARGO workflow pipeline. And that's key to the entire machine learning ARAT pipeline. As I said, the steps of the pipeline to extract, detect, drift, train, and quantize are all running on the NVIDIA device. So let me show you how this is expressed in an ARGO DAG. The extract step connects to a Google Storage bucket and downloads training data. And this is raw training data, and it then transforms and normalizes it. But if you look at these little rectangles that are showing up here, you are seeing how this step is expressed in an ARGO workload DAG. So the step here is shown as an extract task. But what you see on the far right side with the yellow highlighted box is the corresponding Kubernetes container specifications that then ARGO will use to orchestrate and run this container on an Edge device. And you can also notice that the node selector here is what I'm using to kind of control and force a particular workflow step to run on a particular node. The next step is drift detection. It has a dependency, and that's how you use DAGs. It has a dependency on the extract step. That's what you see here expressed in the dependencies section on the DAG. And the following that is a training step. Training has two dependencies. It requires the extract step to complete and the drift detection to be completed. And finally, is the quantized step that takes the frozen file and then transforms it into a TF Lite module. Here are the four devices that I have used. So now it's time to set up the demo. So the way I'm planning to do it is I'll run the demo. I will show you a few things of how they work. And then if we have time, because I'm sitting between you and lunch, I wanted to show you how to tear this whole thing down and build it on scratch. But I doubt I'll have time for that. So let me turn on the demo, and we'll see things in action. Now to do that, I will open up. This is the view of what is running right now on this cluster. Let me expand this, and this gets really interesting. Let me show you the cluster first. What you see here is the control node is running on a VM. And then you see a bunch of agent nodes. There's a TPU1, TPU2, and TPU3. That's running Coro. So let me, I think, I'll have to share my screen, right? Or can I move it to this? You see it now? So this is the view of the cluster. It's too big. No, it's good. Good? OK. So what you see here is view of the cluster, the edge cluster. I have the TPU, three TPU nodes. And they're running here. And you can see their IP addresses and their OS images. There is an NVIDIA Jetson and a Raspberry Pi. And this is the view of what parts are running on the cluster. So if I had the time, I would have shown you how to build this thing. So you see the StreamZ cluster. This is the Kafka operator running on this cluster. Then you have the Orgo server running. And this is the model registry that lets you store downloaded files, the frozen files, the training data, and the firmware photo flashes for the devices, for the MCU devices. And the rest of the jobs are basically the workload DAGs that run this pipeline here. Let's see if I can. Let me delete this. We'll run this again. I'm using the Orgo CLI to submit this tag. As soon as I submit this, you'll see a pipeline starts. Now what is happening here is, is it visible? These things are small font. Can't make this bigger. The extract step just ran. What it did was it downloaded, it connected to the Google Cloud Storage bucket, and downloaded a training data file. And you can see the container logs here. This is a very good tool. It shows that model download is complete. The next step is it's running a drift detection job. And if we go back here, you can see the progress of these various workload jobs on Kubernetes and kubectl. And notice the nodes they're running on. You see this is being scheduled on the Nvidia Jetson, and the job is complete. Now the training job is running. So let's look at the training logs. And you can see it's dumping some of the internals of what is running on the Nvidia box. This takes a few seconds to complete. And this is what I found things interesting is that the Nvidia device has the power to run this training. I'm not running too many epochs, and it's not too complex. But with the right set of hardware and a cluster, you should be able to run these trainings on the edge here. And the model you see is uploaded to this model registry. We can open this registry. As I said, this is a custom registry I wrote in Golang. So it hosts training data. As it's downloaded, you'll see the model. The model is a zip file. So if I open this zip file, people who are familiar with TensorFlow would see the PB files here. And then you have, as the training is running, the next stage is to quantize it. And as the quantize completes, the quantize model will be uploaded here. Let's take a look at what's running on right now. And the pipeline is completed. And now the actual instance modules have been deployed to the cluster. So let's take a peek into this particular pod and look at its files. And that's where you'll see things in action. As I turn on the hardware and the analog tier, you'll see messages flow in B. And you see them in processed here, in the inference tier. So I'm tailing a log file using SSH command into this container, hopefully it works. Let me turn on the demo now. And we can see this all in action. So I'm going to power on the MCU, the induction motor. Oops. And at some point, we will see the messages flow in. Those empty-to-tree messages are coming in from the device. And let me introduce a fault in this motor and see what happens. So for that, if the fault is detected, you'll see this MCU, its LED should start, a blue LED should start blinking 15 times. And then it will send a message to the cluster here. So that green light there actually is showing that an inference job is running on that particular node on the cluster. So let me just cut off the power, choose something is wrong. So what is happening is it detects a particular failure on the system. And there is an inference model running there that LED blinked 15 times. And then it sends a message over a bridge to the modules there to process the information and to actually make inferences on what were the conditions that led to this particular failure. So folks with that, my demo is complete. I wish I had the time to show you how I built it. But at this point, let me. All right, so this message here is important. You see an inference level 2 message showed up here. What that means is the way internally I classify inferences is that if the inference is running on the analog here close to the MCU on the sensors, that is inference level 1. That is mostly pre-processing. Inference 2 is running on another MCU. That is running the TensorFlow Lite module that looks at that message and sees if the logistic regression algorithm is detecting a particular condition, only then a condition that is above certain threshold only then transmit the message to the rest of the tier. So that is inference level 2, which means that the conditions have been pre-processed. And only the signals that have consequence are being sent further up the cluster. So that's the stage inference that I'm trying to show here is the first level of filtering happens on the analog there using DSPs. The second layer is happening on an MCU. And the third level is happening on the TPU cluster. So that's it. That's my demo. Open to questions now. Awesome, thank you. Yes, we are open to questions. While we're getting started here, we do have one from the internet. This is a great presentation. Where can we get more information about your reference architecture? That's a good question. While I'm planning to write, I'm actually in the process of writing a blog. And I will put all of the source code on GitHub. So you will have enough chance to enjoy this demo and the source code. Here, I'll go to the closer one first, sorry. Hey, thank you for the demo. It was really interesting. My question is, can you go over a bit on the problems that machine learning will help solve for IoT devices? Yeah. So the question is, Jim, are you repeating the question? If you can repeat it, there will be questions. So the question is, what class of problems can machine learning running on the edge solve? It's a good question. For an IoT engineer like myself, the first problem that I see is closed loop decisions, wherein when the inference or the actual machine learning logic tells you something is wrong, the device can actually trigger what's called a closed loop actuation. That means it can tell another set of hardware to intervene or do something. It doesn't have to wait for this information to go up to some back end of the cloud and come back. There's too much latency there. That's number one. The second set of possibilities that opens up is now being able to keep the data itself on the device. So there are a lot of concerns around privacy and attack vectors on the devices connect to the internet. We can resolve that. Any more questions? Yeah. Why did you separate the training layer from the inference? Rather, why did you put the training in one device and inference in the other? More specifically, why don't you use the innovative JSON to do both training and inference? Are you running those two concurrently? OK, the question is, why did I separate training from inference? So if you go back to the complexity chart I showed you earlier, the amount of computation required to do training is in order of magnitude higher than inference. So the separation ensures that you have the right hardware, the right class of hardware, and right class of accelerators to power the training computation. But the video just now you said can not only do the training, but obviously the inference as well. It can. OK, so I see your question. The amount of power and network and bandwidth required to run training will be an overkill to run it on the inference layer, because the inference devices are running close to the analog devices and the sensors. And a lot of those times, those devices are battery-powered. So it makes sure that the inference, the training jobs, do not consume all the power and the computation. And there is nothing left to process the sensor data. But if that is indeed the case, why not just train on a really large GPU on the cloud? Why place even the GPUs? Why use an Edge GPU if you're not going to put it right on the edge? This is not a thing. I'm just curious to see why you chose the architecture. As I said, the reason why the training is brought to the Edge is so that the Edge can stay in a way decoupled, disconnected from the cloud. So all the areas around privacy, bandwidth, latency can be resolved by running these things completely on the Edge.