 So, hello and welcome to my talk about stream processing at the Edge with Apache Kafka My name is Jakub Schultz and I work at Red Hat as engineer and I'm also maintainer of the Project called Strimsy, which is cloud native computing foundation project, which is about running Apache Kafka on on Kubernetes and Yeah, obviously that will be one of the things which I will use later in the demo How many of you know what stream processing is? Right, that's quite a lot of hands For those who don't know you can try to imagine it as that the world around us is full of different streams of events and streams of data For example in the room like it is when someone opens a door It's basically an event someone open the door and someone close them Someone close the doors when someone enters the room or leaves the room that can be another event And you can take these events and kind of Process them as a stream So for example if you take the events that someone entered the room Like now or left the room then if you process the whole stream then you can basically count Okay, one person entered second person entered third person entered then one person left then three person entered So we have in total five people in the room and that's an information you can build from the stream of events and Then you can use this information for some more things So I'm quite sure there is some fire code here which says that there is some maximum number of people in this room So you can for example use this information for some alert thing which would say okay If there's more than 50 people in this room Then let's send some alert message to some fire marshal and he will come and say oh you too You are over the number go out And then of course you can also do things like Install some sensors which would for example monitor the temperature air quality or the noise level And then you can kind of combine this information So how does the noise level depend on the number of people in the room? How does the air quality depend on the number of people in the room? and you can do things such as Improved air conditioning or the audio system depending on the number of the people to for example better Conserve the energy and so on so that's kind of how you can imagine stream processing And now Why does this make sense to do it at the edge? There's a lot of different industries where kind of this pattern fits in it's connected to how the world is changing how we are digitalizing everything and How for example, we are changing the ways how we produce things such as energy So you can imagine in agriculture you can for example use some sensors to monitor the fields Whether they have enough water or enough sun whether there are some animals eating the crops and so on in Pharmaceutical industry or chemical industry you can use it to monitor some processes You can use it to monitor the environment because for example A lot of pharmaceuticals or I don't know the covid vaccines They have very strict limits at which temperature they have to be stored for example or that there has to be no sun Whether they are stored so all these things can be again monitored with stream processing as well And when you breach some of the rules it can automatically somewhere mark some batch as Damaged for example, or you can use it in sports to monitor performance in manufacturing to Monitor for example some robots and some car manufacturing line And when you see that something's going wrong with some robot you can kind of proactively Fix it because stopping the whole line when it breaks will be too expensive and you can continue like this in transportation waste management logistics telecommunications energy Retail and so on so it kind of fits into all these different areas and all these different industries and It's good to process the data at the source because that's where they are produced and it gives you several advantages so one of them is that You get better latency and speed of them like the latency There's one obvious part if you don't have to send the data for example into some cloud and then get them back When they are processed then you save some network latency, right? But it's also if you do it somewhere centrally in the cloud Then you might have many locations sending there the data and you might be queued for processing behind someone else Whereas if you do it directly at the source then yeah, you are not queued behind anyone else You can better control how much processing power you have and you can make sure that they are processed really quickly and in time another advantage is that You are more resilient against things such as outages now in some of the areas You might be doing the things in the middle of the nowhere and there might be bad connectivity Which is unreliable, but even if you have some Retail space some shop for example or supermarket in some shopping mall which has in general good connectivity It might happen that the connectivity goes down because someone Did something wrong right and in that case if you are doing the things Locally at the source at the edge then you might not need to for example close the shop because the connectivity is Not working you can continue to operate as as normally And in some environments, it's kind of not just a question of some some problem with the connectivity But it might be disconnected for a long time by design if it's a cargo ship somewhere in the middle of the notion It might not have the best connectivity possible Similarly, if it's a plane somewhere over a North Pole, I think there's not good satellite connectivity either So there are situations kind of where we are by design not connected and where this fits quite well and last but not least it obviously impacts also the cost because If you ever use some public clouds like AWS, you know, for example at quite often the data transfers are the biggest chunk of Your invoice. It's not the VMs or the storage It's just sending the data here and there and it's a bit similar here as well If you don't send the data if you process them locally then you can save some costs as well So I wanted to also touch a bit on what I mean with the edge here What kind of environment I'm talking here about because there are many different use cases and they don't always talk about the same thing So some for example in some use cases in some situations people would be talking about edge as some IoT devices But as I'm talking about using stream processing in Apache Kafka This would not have really the power to do these things, right? So you might use them in the environment, but that's not what we will be running the stream processing on similarly in some cases Smartphones or tablets might be the thing which might be the edge and in some use cases. It's completely valid and true but Also quite often actually these have enough power to do these things these days It's not really designed for for something like that. So so I'm not really talking about those either but When it comes to things such as single board computer such as Raspberry Pi then that already is something where it can work now Raspberry Pi is not really Professional industry grade device, but you can use it you can easily buy it and Yeah, you can do stream processing on it quite easily and actually I will use those in the demo at the end Then in many cases You might use some consumer PCs or some kind of Old PC or some powerful PC lying under the table in some office and kind of running some Coffee shop or some some supermarket or some fast food chain and so on and that's again, that's completely fine that's completely valid in some cases, it's really used like that and Yeah, it works and then of course in many cases It will be just a regular data center with rex and and real servers It's just maybe won't have hundreds of rex or thousands of rex It will maybe just have to have just few rex, but yeah, even that's something what can be called edge and Yeah, obviously you can use it in this case as well so Why should you use Apache Kafka for these things, right? That's Something I will be focusing on a lot. So so why should you even care and Kind of the most lame argument is that it's kind of leading Even streaming platform, right? It's used for these things for years. It's well proven Is used by many organizations and so on so yeah, why not use it, right? but there are some more practical arguments as well and One of them is that it has all the tools you need to do this So you have the Kafka brokers which kind of always sit in the middle of the architecture And they are the thing which kind of receives the messages From some producers and then pass the messages to some consumers and then you can kind of process them and do whatever you want with them Then you have the streams API which Is the actual streams processing engine now the nice thing about it is that is basically just a library Which you can include in your own Application it's not some huge framework with some worker nodes and some schedulers which is heavy and complicated to run It's really just something what you take as a library include in your application and use its features And so this kind of being lean kind of fits quite well with the edge as well But despite that it actually has quite a lot of features So you can do all the simple stateless things such as Transformations filtering and rich man's but you can also do the more Sophisticated and heavy things you can do time window wings You can do stateful processing with with aggregations and storing the data and all of that can be done While still being scalable so you can just scale it and run it in multiple instances to get better performance Then the next part is the Kafka connect API. So I hope that Everyone after the talk will be convinced to use Kafka at the edge But it's would be quite rare that you would use only Kafka there You would probably have other things as well So the integration part is important because you want to take the data be able to send them somewhere else Integrated with other systems and that's what the Kafka connect API as a Kafka component does and then another part is something called mirror maker which Can be used to mirror data between different clusters and that's useful because you can then basically copy these data between the different environments and share them that way as well and then last but not least Kafka has also quite a lot of Integrations for all kind of different tools which are not part of the Kafka project itself So for example, if you would want to do something with AI or machine learning There's a lot of tools which kind of easily plug into Kafka as well And you can for example pre-train your models somewhere else And then just use them at the edge and plug them into Kafka to kind of read the events and process them so hopefully that gave you some kind of Introduction to the different parts and now let's have a look at How you can do this as a kind of introduction as a pattern how you can which you can use so Typically you would start with some input data So you might have some IOT sensors which are feeling feeding you some some environment data You might have some applications running on Smartphones or running on some point point of sale terminals, which can again feed you the different data you can have Different beacons who monitor how people move through some space You can have cameras and all other monitoring tools and all these things can be kind of plugged into it and Then you will have the Kafka brokers Which is I said kind of sit in the middle of everywhere and they store the data and they kind of distribute them But quite often it would happen that the things such as for example IOT devices will not talk directly With Kafka in the Kafka protocol because that's a bit too heavy for that and too complicated So quite often it will have some kind of bridge sitting in between those can be for example using MQTT Or it can be using HTTP It kind of gets the data in this little bit simpler and easier protocol from from the IOT devices And then it will basically format that as a Kafka connection and pass that to the to the Kafka brokers and then From the Kafka brokers you can kind of connect the different Stream processing applications and do the actual stream processing where you read the data sent by the sensors or by the other input devices and You will basically do whatever you need to process them And of course you can also connect other clients and other applications depending on what exactly you are doing And that's basically our our edge environment But what's kind of important is that the edge environment doesn't live on its own, right? It wouldn't be edge if it wouldn't be on the edge of something So there's typically some some other part of it as well I call it here HQ as headquarters But you can call it central cluster or something like that and it's again quite typical part of it there You have something central running for example in some cloud or really in some headquarters in some on-prem data center And you want to link these two parts? So that's where kind of the mirror maker tool fits in because it can basically mirror the data between the different Kafka clusters And that's quite important because when you do the processing Maybe you aggregate something and then want to push the data into the central Information system of the company or of the organization, but also it can work in the other way around So quite often you for example have some reference data or some master data Which you are managing in the central cloud in the central environment Through some other tools and then you always need to kind of take them and synchronize them to the edge locations So that they can use the latest reference data. So that's what you can do with mirror maker and then what's also important is That usually you don't have just a single edge and the central cloud usually have many of these edge locations, right? So you need to be able to manage them and monitor them in some efficient way because you don't really want to SSH Into all of these and then install something or upgrade something or check if it's still running So you want something about gives you as little work as possible and that's where the things such as Kubernetes or the Kafka operators as trimzy which will help you run all these components Come in because they will do a lot of these things for you and make it much easier for you to manage this whole architecture One of the advantages is that it allows you to run the same software everywhere So you can use kubernetes and you can use trimzy or the other operators if you would use database for example You can use them in the central cloud, but you can use them in the edge locations And even at the edge locations for example, you can use them on different hardware platforms You can may have different situations Everywhere you can use the same tools You don't need to learn new things every single time and you can use them exactly in the same way So you know that when you are fixing something on one environment Then it will be exactly the same environment as the others But thanks to the kubernetes and operators you should not actually need is that much because they kind of help you with Making the environment resilient and self-sustaining right so the kubernetes They will take care of Restarting the processes the operators will basically make sure that the applications are running that they are upgraded properly That they are maintained that they are monitored that they are Certificates are renewed and all these things for you and they kind of Do this all job for you and you don't need to take care of it that much And then finally there's a lot of tooling around automation which is available for kubernetes such as the different github's tools Argo flux and so on and you can use those to kind of manage these Configuration so with operators you have it all in declarative manner where we have some yamlas You can kind of store them in your in your gith environment And then you can use these tools to kind of roll it out to all the different locations in In various patterns so that makes your life much easier as well so let's have a look at a demo and At the end I have there a link which gives you The github repo with all the source codes and so on so that you can look into that in a bit more detail But basically what I have it's running at home in my homelab and there's some ESP 32 based sensor Bored with some sensors for temperature and humidity and air pressure and so on and that's then connected over Wi-Fi to the kubernetes cluster which is running There and it's actually using an HTTP bridge to kind of communicate with the Kafka cluster So it the board sends HTTP request and the bridge converts it to the Kafka broker and then it does the stream processing using the Kafka streams API application which basically just to demonstrate something simple it does kind of one minute windows and over the windows it calculates the average Values and then sends them into another Kafka topic and these are then mirrored by the mirror maker into the kind of central cloud where There is a simple application which basically takes these data and visualize them them centrally so It's not Really industry grade equipment, but it's really running on the on real hardware. It's not just some software emulated things So it's using this ESP 32 board with the sensors and it's and it's using it in micro Python and Then I actually have a Raspberry Pi cluster, which actually has three nodes not four nodes when I when I did the demo And that's running the k3s Kubernetes distribution across it and it's using the streams operated to run the Apache Kafka Components for us So I hope you will forgive me for using recording but actually running in Resilient and self-sustaining matter at my home behind the nut. So it's not that easy to do this life But let's first have a look at how the IOT board looks like so as I said, it's using Micro Python that's kind of a cut-down version of Python if you don't like Python don't worry. I'm not biggest fan either, but it's actually quite easy to use and and Even with a little knowledge, you can easily write some things and what you see here are the basic files for it the BME 280 is the library for working with the sensors and then the boot Dot Pi script. That's kind of what's called when the when the board starts and it's just some initialization work like connecting to the Wi-Fi So it can send the data and and things like that and Then in the second file in the main dot dot pi file That's what's actually running there All the time and doing the actual work and it basically first loads the library for working with the sensor then it initialized some Some led led lights to kind of use as a to signal some problems So it will be kind of switched on if something stops working and then It kind of configures the the different data like the location of the sensor and the URL of the HTTP bridge Where it kind of sends the request and then it's basically running in a in a while loop Every second and it reads the data from the sensors. It formats a JSON message From it which it then sends using HTTP post to the to the Kafka broker as As a message and when that succeeds then great. Let's do another loop and when it fails and it kind of just Switch on the led light to kind of indicate that Yeah, something's wrong so That's basically how the How the IoT board looks like and now let's have a look at the at the edge location So as I said, it's kind of Kubernetes cluster It actually has three notes when I recorded them all because I had one of them switched off But it's running the k3s distribution and when I check the the ports which are running there You can see the stream the operator which is kind of running all the Kafka components And then you can see the Kafka cluster with Kafka pot and zookeeper pot and you can see also the entity operator which is used to manage users for security and Topics for where data are sent and then we have the mirror maker which is doing the mirroring and Then finally we have the bridge which is doing the the bridging between the HTTP world where the IoT device sends the data and the The Kafka broker and now what we can do because it's running We can start a Kafka consumer and connect it to the topic in the Kafka broker where The data are being sent from From the device It's actually using this Wi-Fi people Which is actually running in in a pot the whole Wi-Fi setup I find that quite neat, but I don't take any credit for it I think I copied it from some colleagues who may be copied it somewhere else But yeah, now let's run the let's run the consumer and that should show the Kind of raw data as they are being sent by the sensor So we should see roughly every second a message in the JSON format with the readings from the sensor So it takes a few seconds to start the consumer pot now you can see the see the messages as they are kind of coming in through the bridge from the IoT device and Once we have them in the In the Kafka broker we can move on to the to the stream processing So there's one other pot running in this namespace and that's the aggregator pot Which is running the Kafka streams API application Which is actually running inside this Quarkus Framework or tool kit which kind of makes it easier to just run it and it does some special tricks for in The build time to kind of optimize it But what it's basically doing is it's reading the data from the Kafka broker every second And it's creating one minute windows window after window over them and then calculating the average for Given minute and then when the minute window ends it will basically take the average value and send it to another topic in the Kafka cluster now obviously this is more for demonstration purposes It's not the most sophisticated processing possible, but it should kind of give you idea about How this can work and now when I open a consumer against this topic with aggregated data We should see again basically the same message, but now it should be Send just every minute because it's basically result of the of the aggregation And that's kind of the the average value which kind of softens out the numbers we are getting from the from the sensor So that's kind of the the edge location and then the last part is the Center location where we kind of get the data from the edge location like I have at home right now only one but in Reality you would have many of them feeding the data there And what we can see here is again the familiar thing the stream see operator on the Kubernetes cluster Running the whole Kafka cluster this time. It's a bit bigger So it can handle more data and more things. It is also some more components But then there is this front-end application Which basically reads the mirror data from the edge and then it shows them on a on a map Or it also kind of keeps them for the history and shows them in in Prometheus so that you can kind of look how the how the values were evolving and And Yeah, what we can check also is how the data are being mirrored from from edge location So I will again run the consumer is the same as the last time but this time it connects to the central cluster and not to the Kafka cluster at the edge and We should again see basically the same messages, but this time these are the mirrored messages mirrored using the mirror maker and Yeah, so here's one And if we will wait another minute, we will get next one and so on But let's switch to the to the browser and just ignore this for development purposes That's when you use Google Maps without entering the production API key but what you can see is it just shows the map and It shows the pin with the location of the of the sensor and then it gives you the details and when you When you click on it, you can kind of get to the to the Prometheus chart Which gives you the the history and shows you how? how it was evolving so That's kind of the the demo part. It's it's not simple. Sorry. It doesn't have too many things Which are kind of not focusing on the on the pattern, but should demonstrate. Well, how? How Kafka helps in this area, hopefully And that's it for the talk. This is the URL there you can find the demo it will redirect you to the to the github repository and Yeah, thanks for thanks for listening thanks for watching and if you have any questions We should have some some time for that so in It's quite complicated because you would need to create different listeners in just to repeat the question the question was if someone runs the Kafka in the containerized Kubernetes environment how to get access to it kind of from the outside and It's not easy because of how kind of Kafka has its own discovery mechanism, which makes this quite complicated, but basically you would need to To configure the listeners in the way that you have for example multiple listeners one of them will be kind of for the internal applications running inside the Kubernetes cluster and Other one will be for the applications running outside and you have to use in the other one the for example Addresses of some note port or some load balancer address in there to kind of advertise it to the clients And then use this listener for it So if you would use the streams it running on Kubernetes then kind of it has four different types of listeners like using load balancer Using ingress using open shift throughout and you can kind of just configure them And it kind of does all these things in the background for you and then just gives you the address Where to configure the client on but yeah, you would basically need to play with the advertised Hosts configuration in the in the Kafka config if you would kind of do it yourself Any other questions? Yeah, so the question is if you would have multiple edge locations with basically multiple mirror makers synchronizing their own data if it would be synchronized into the same topic or Into a different topic for each location just to repeat the question And I think it depends a bit like with a with a project like this where we are just sending some sensor data The same topic makes most sense because then it's much easier to consume it But there definitely might be some cases where different topics Might make more sense where for example, I don't know you work for some organization We just some other kind of divisions and so on So the the question is about the HTTP bridge if this is a standard component or if that's different projects and so on It's actually what I'm using here is directly from the Strimsy project We just its own HTTP bridge and you can basically deploy it with the operator Just by specifying a custom resource. So that's kind of easiest and that's what I use there. There are some other Bridge components out there as well. There's the confluent rest proxy, which does HTTP proxying as well Confluent as an MQTT bridge as well, which you can use if you want MQTT for example and There are others as well So you can kind of use what I used as a Strimsy maintainer was obviously a Strimsy bridge But you can if you would want you can use others as well. But yeah, I use yeah Sorry, what? Yes. So so the question was how do I configure the aggregation part? So it's using the Kafka streams API. I Should have prepared it so I should have showed the source code But the source codes for it are actually in the github repository as well but it basically in the streams API you just say start from this topic and Specify this the service for kind of decoding deserializing and serializing the messages Into Java basically and then you just get the messages as object and you can work with them in the streams API DSL So you just start with this from and that basically in this case that gives you kind of the data And then you just use the DSL to say okay with window ink with one minute windows and then The most complicated part of it is actually writing the custom aggregation to kind of do the averages But otherwise, it's it's quite simply if you check the github repo. There's a it's like free Java files all together or something like that so I I'm not sure exactly How easy I don't think it would be that easy to use it to distribute commands, which would be to specific edge locations Right because you would for example need to have different topics For kind of each location to be able to say okay location with this ID Do this or you would need to mirror the commands for everyone to every edge location and then let the edge location For example filter it out whether it's for it But it's like super useful if you have some reference data like if you I don't know use it In some supermarket to update all these digital price tags Then if someone changes it in the in the central change the prices you can kind of roll out the updates to all the edge locations this way for example or other kind of reference data where basically the same data can be sent everywhere and applied to all Locations I think that's where it fits a bit better than some kind of command Pattern anything else in that case thanks for thanks for joining me and thanks for your time