 So thank you everyone. Thank you for joining us this session. I really hope that you are going to enjoy this as already mentioned by Lubomir, it will be about using Formula 1 telemetry data and doing some processing using Kafka streams. So let us introduce ourselves. First of all, I am Paolo Patirno. I am principal software engineer in Red Hat working on the Apache Kafka team and even on the StreamZ upstream project where StreamZ is a CNCF project which is about running and deploying and handling as well Apache Kafka on Kubernetes and I am presenting today with Tom. Introduce yourself Tom. Hi everyone. I'm Tom Cooper and I work on the same team Paolo on the StreamZ project and I've got a background in stream processing. Thank you Tom. So let's see what we are going to see today. We are going to see how to build an event streams pipeline using, as you already mentioned, Formula 1 telemetry data as a kind of example of events that we can ingest into our system. So what are the kind of problems that we want to solve every time we need to build an event streams pipeline? First of all, we need to ingest events in a reliable manner so we don't want to lose events, we want to store events, things like that. The other part is to quite often to integrate different systems with our ingestion system. So the events are coming from something that is not able to talk for example the same protocol as the ingestion system. So we want to integrate this system from where the events are coming and ingesting them in our final ingestion framework and then on the other side of the pipeline providing some outcomes, some insights of the events that we are getting in the pipeline itself. Of course ingesting events is useful when you are doing something with these events, with the data that are coming in your pipeline. So we will see how we can process the events in real time so for getting insights in real time. Finally if you have events, if you have data, if you are processing this data, you would like even to show some insights, some outcomes of this processing somewhere. And most of the times they are just dashboards that you can use for showing this data flowing through the pipeline. We also see how to run and deploy an entire pipeline somewhere. So let's dig into it more deeply. The way that you are going to show this kind of demo is starting from Formula One telemetry data as I already mentioned. We don't have of course a real Formula One car. It's quite impossible to find some real Formula One data on Google, on internet in general. So what I found thanks to my son with 80 years old playing with Xbox is that the Formula One 2020 game from CodeMasters provides you the telemetry data in real time on UDP. So what we did is writing first of all a library in order to decode the Formula One packets coming from UDP in a Pozo corresponding Java classes that we can handle easily. So the first problem to solve is where to ingest this telemetry data coming from the Xbox. One of the most well known systems for ingesting data is Kafka. Kafka is first of all a messaging system that you can use for publish, subscribe, but over the years he becomes really one of the most important data streaming platform for event ingestion. So because I also with Tom works on Kafka, of course we choose Kafka as the best ingestion system for using for this demo. But the first thing to solve is how to go from UDP packets coming from the Xbox to Kafka. You can think that we can just write an application listing on UDP port and then using a Kafka producer API for writing to Kafka topic. But in order to write this, there is a lot of code more or less to write, right? So we decided to use Apache Camel. Apache Camel is a project is an integration framework in order to integrate different systems. So you can just easily write an application using a DSL for describing what they call a route. So the path that your data has to follow. In this case just reading from UDP and then writing to some Kafka topics in a really simple way as we will see in the next slide. The next part of the pipeline is in this place running a Kafka Streams API application that will do some processing. And Tom will explain more about that later. And then finally having another Camel application to integrate Kafka with InfluxDB. So getting the data, the insight from the Kafka topics and writing them as points in the InfluxDB. In order to use InfluxDB as a data source for the framework for the platform, Grafana platform, that we are going to use to show some dashboards with this data in real time. All this pipeline will run on Kubernetes. In this demo, it's going to use OpenShift with this kind of Red Hat Enterprise distribution of Kubernetes. So we are going to use Stream Z for deploying the Apache Kafka cluster on Kubernetes really easily, because Stream Z will provide some native custom resource for describing your Kafka cluster, your Kafka topic, and it will take care for you to deploy the Kafka cluster with not so much effort even in handling upgrades and things like that. So let's jump into detail all these pieces that are making this pipeline. First of all, ingesting the data from UDP, so from the Xbox. This is, as I already mentioned, a Camel application, which is made by a first route, which is just using a UDP component for getting the data from UDP. Yeah, it's just a kind of meta language here, but Camel is more or less this, really these three instructions where you can just read from something, in this case UDP, writing to something. And in this case, I'm using a multicast in order to have three destination routes, because at some point I will write different data in three different topics in Kafka. There are three different routes because one route is about just getting from UDP and then writing directly to Kafka topic, the row packets with no processing. The second one, applying a filter. So even in Camel using its DSL, you can just filter on something. In this case, I'm filtering events like speed trap and fastest lap, things like that. I'm going to do some simple aggregation. The UDP packets coming from the Formula One, 2020 games are packets where in one packet you have all the information about all the drivers. What we want is having one packet for each driver with the telemetry data. So we are going to do some aggregation of a specific window of different packets and aggregating the data for a specific driver than sending to a corresponding telemetry drivers topic. So this is a really simple way to write an application for doing this kind of stuff. The next step, and Tom will dig into it more deeply, is about processing the telemetry data coming from the drivers. And the example would be just about processing the average speed in the last five seconds. And finally, we have to move the data from Kafka topics to influx DB as time series database as data source for our Grafana dashboards. So still again, simply using Camel instead of writing our own application. We are consuming from Kafka and writing to influx DB simple DSL to process these, the data coming from the three different topics that we are using the drivers one average speed and events, and then writing to the corresponding database formula one in influx DB in corresponding measurements that are more or less tables in our database. So processing and writing here. Say that these are the three main pieces that are building our demo. Let's end over to Tom that you're going to explain how our, our Kafka streams API application is working. Thanks, Paolo. So yeah, so Paolo's spoken about the way we glue this pipeline together with Apache Camel that kind of provides that basic extract transform load functionality. But what you need to do once you've got your fantastic data into your into into Kafka is actually do something with it you need to enrich it you need to get some some value out of it. So to do that we're going to use Kafka streams, which is a Java library that's provided part of the Kafka distribution. And what allows you to do is just build a Java program that gives you a fully featured stream processing pipeline. So if we can go to the next slide Paolo. So what we've got here is the first part of our pipeline so what we're trying to do is calculate the average speed of each driver in five second windows. So to start with, as with everything with Kafka, you need to be able to tell Kafka what the raw bytes it's storing are so that's what this setup does it just provides a way to deserialize the GDP packets and that comes from Paolo's library. So once you've got that setup, you can then connect to your Kafka topics so that's what we're doing here. We create the first K stream. And what we're doing here is we're going to get just window each of the driver's information into these five second windows. So that's the first operation. So in operation one here we connect to the Kafka topic and we provide those those serializers and deserializers. Then we do some filtering to filter out bad bad messages and we can do that that's just based on a predicate so you can provide any function that just provides a Boolean there in this case we've got something built into the to the library Paolo built that tells us if it's a bad message. Then we do a map operation what we're doing here is just stripping down all the driver information of the detailed driver pojo. All we care about is the driver ID and the speed so at the end of step three what we've got is a stream of tuples, which is just driver ID and speed. And then the the aggregation part is that first we want to group those together into into just drivers so you've got collections per driver, and then you need to window those based on their time. So in Kafka, every message has a timestamp attached to it, but in stream processing time is important you need to there are three main types of time in stream processing there is the processing time, which is the time that the message is actually looked at by the thread in this JVM that's doing an operation on it. There's ingestion time, which means different things in different systems but in Kafka it's the time that the message is written to the log. And then there's event time, which is the time that's inside the packet that arrives at Kafka so you can imagine this is the time on your temperature sensor or your your weather sensor or in this case it's the the time that comes from the Xbox the time that was was there. So what we're doing here is we're rendering to five second windows and the default in Kafka is to use the event time so that's what we're going with. If you've ever done any windowing in stream processing, you'll also know that there's various different types of window tumbling hoppily sliding session windows. What we're doing here is what Kafka calls a tumbling window which is the simplest type of window it's the default window, and all that is is a five second chunk when the five seconds are done it completes and it moves on to the next one. There's lots of other configurations that you can do with windows around grace periods and out of order events we're not going to go into that now, but it's definitely worth looking at. So we go to the next slide Palo. So what we've got here is the next stage of the pipeline which is so we've essentially got collections in five second chunks of our drivers information so we will have a bucket of five seconds and in there we will have a bucket for each driver with their speeds in. And so what we could do out of there is simply do a kind of another map operation that takes each of those bucket collections and turns them into a count and some and then do a reduction on that to to get the final average speed. But what we're showing here is another aspect of the of the Kafka streams library, which is that of a K table. And what a K table is is essentially like a table in a database it's like a materialized view of your stream. And this builds on this idea of a kind of table stream duality which is this idea that the stream is a change log for a table essentially. If you think about word counts you could have a stream that is simply the words, and then a table realization of that is the frequency of the words. So what we're doing here with the speeds is we're creating a K table, but firstly just has the driver ID count and some updated every five seconds. Now, what K tables allow you to do is they allow you to be a source for other streams that you might want to build off that but crucially what they do is they also back up your state. So behind every K table is a rocks DB instance, and every change that's made to that rocks DB instance is streamed back down to your Kafka topic. So kind of state backup will happen wherever you have state in a Kafka streams topic so earlier when we had the windows that will also be backed up to a Kafka topic those each of those windows and collections, and that provides you fault tolerance and also comes in when we talk about scaling a bit later. So we've got this K table here. And then what we do is we take each of those rows so every time they're updated every five seconds we stream them out to another table that simply does the averaging so at the end of this what you've got is a K table instance. That has driver ID and average speed. And what we're not showing here is that the next operation is simply to send that change log of that table essentially the every time that updates back to the Kafka topic, which is the final Kafka topic that we then connect up to to influx DB. So if we can go to the next slide by. So now you've got your whole pipeline you've written it you've got your Java program. What do you do with it and stream processing systems can be very complex if you've ever used Apache storm or flink or spark streaming. And but Kafka streams is very straightforward you just have a Java program. So long as you can provide it the contact in the config information and it can see the broker you can run your stream processing pipeline. So if we go to the next slide Palo. So one final thing I want to touch on here is how you scale it because in stream processing scaling is an important thing you need to be able to cope with high volume coming in to be able to do parallel processing. So Kafka streams library is built on top of the consumer API and Kafka. So it has this intelligence client attitude so it will be able to balance between the members of its consumer group. So if you've just got if you just start up one Kafka application, then what you've essentially got is your stream from every partition on the input topic so if we've got four input topics, all the information will be round robin into this one application. What's actually happening under the hood is that Kafka stream still treats this as four separate jobs essentially by partitioning that internally. So if we go to the next slide Palo. So what happens when you want to scale up is you just simply start new instances of your application and because it's just built on top of the consumer API when you start new applications that's a new member of your consumer group and the client API in there will rebalance the the available partitions between your available applications. Now, the one thing you have to care about in stream processing is that state we were talking about so we've got the windows we've got those k tables how do we now chop up the state that's inside there so it runs on different machines. Well, this comes into how we back it up to Kafka, but because each of these tasks internally into each of these applications is already partitioning that state essentially when they back up to the Kafka cluster that states already separate. So if it's all running in one you've got four tasks in one, looking at those topics if you split it up into three you've got two tasks in one one in each and they can just go to the particular partition topic when they start up and recover their state and get processing very straightforward way to scale up and scale down and also provides that fault tolerance. So, if we go to the next slide Palo. So that I think that was a whirlwind tour through through Kafka streaming and what I'll do now is I'll hand over back to Palo to to show you this wonderful pipeline working. Yeah, thank you very much Tom so it's right now the time for for the demo. Let's see if God's demo will be with us today right. So as you can see here. As you already mentioned, I have my open shift cluster running for deploying everything I have used the string Z operator. So string Z as you already mentioned is a CNCF project that allows you to deploy and handle really easily a Kafka cluster so I already deployed everything so just describing my Kafka custom resource. So if we take a look at the YAML here, you should see that you can just describe your cluster in terms of configuration in terms of listeners for exposing your cluster outside of open shift the number of replicas the version to use for Kafka and things like that everything about Kafka just as a native Kubernetes resource so without dealing with deployment state foot set it will the string Z operator to do everything for you. So if I use the streams operator for deploying I have my Kafka cluster running here. I also have Grafana and influx DB running and the Kafka to influx DB camel application and strings API application for processing the data. And here as you can see we have some dashboards with some graphs in order to show them the data that I'm going to produce. So of course we don't have my son right now here to playing Xbox. So what I found that was really useful was a Python library that is able to record some telemetry data from the Xbox. So actually listening on UDP and storing this information in SQLite database. So on my screen here you can see how you have a console where I have I'm going to use this Python script not only for recording something that I did yesterday with my son playing but mostly even you can play you can replay this telemetry data from database. So let me start the telemetry application to the player application to produce these events. And as you can see here on the Grafana dashboard. Let me just move this on the other screen. You can see here that on the dashboard data are coming through the system. So we have the speed about cars the engine revolution the throttle the break other information about the g force the damage on the wing front and rear and even information about the events that I was mentioning about speed trap and fastest lap. There are other dashboards that we made. Like for example the one more specifically related to a driver. So you can get different graphs with more information about throttle and break engine and speed even plotted on the same graph because it's interesting to correlate for a formula one engineer all this kind of information but even information about the breaks temperature the steering the g force again and even the shape of your tires while during the race all useful information for improving your performance right and the last one is about the the Kafka Streams API application where yeah it's just showing these average speed in the last five seconds and as Tom already mentioned you can see this kind of spike because we are using these five seconds tumbling windows in Kafka Streams application. So this is most likely how the demo is is working and what we wanted to show you how to build a kind of pipeline for events and how to use a cool things like formula one telemetry data for for showing that coming back to the slides. Of course the ball later on after the session. There are some resources that you can use in order to play with this. There is a blog post that we wrote on the Grafana website about explaining more or less what we just explained right now during the session. Some links to the coding library and corresponding Kafka project with all destruction that you can use in order to run your pipeline locally or even on the Kubernetes or open shift cluster that you have. There is also a video that I recorded with my son so there is actually my son playing so you should see on one side of the screen the cars and the formula one 2020 games playing and on the other side the Grafana dashboards updating in real time and of course because during this session we explain a lot of technology or we trying to provide you some hints about the technologies that we are using. There are some links to, for example, the specification of the packets of the formula one 2020 games links to Kubernetes and open shift to Apache Kafka to the stream project that because I work on streams. I really would like to hear you some feedback on streams if you are going to use for something on about Kafka Kubernetes, even about camel for integrating all the system that we are going to use. And even finally for influx DB as the time series database and Grafana platform for showing all the data on the dashboards. So that's all from our side. Thank you very much for joining us. I don't know if we have time or we are run out of time, if no time and if no questions we will be on discord. In order to answer your questions if some. Thank you. Okay, so I see there are some questions in the chat. Do you want me to read it for you. Okay, maybe the first one, there is how does camel compared to connect. So, there are different projects right so camel provides a lot of components and library, I guess more than 200 or even more for connecting a lot of different systems. Kafka connect provides some different connectors as well. I don't think that we are kind of the same kind of numbers, even if there is an interesting project from some of our colleagues to use camel connectors so adding to Kafka connect some connectors using camel with Kafka connect you have to think about you have to deploy a kind of new cluster which is the Kafka connect cluster with some different workers working using these connectors and moving data across systems Kafka connect is really useful somehow for example when you want to do some changes, the database captures using the division project so moving data across databases. I would say that camel is really useful when you want to integrate things like HTTP application and UDP application in this case, or even some other messaging system as well. So yeah you can find different use cases for using them both even together some somehow at some point.