 Thank you. Hi, everyone. My name is Guillem Moutier. I'm a data engineering architect that's right at mostly what I do is to help our customers or to help organizations set up data science platforms. And when you set up such platform, sometimes the most difficult part is the data engineering part, you know, all the infrastructure that you need, setting up your pipelines and all the things that goes further than just turning a model and maybe just deploying a model. So today we'll talk about a use case scenario on the smart city where we will see how to implement a data pipeline from edge to core. First, we start to set the stage with a few words about why we are doing this. Oftentimes we find that the data engineers, you know, they got a good product documentation on specific product itself, but it's harder to find full documentation and simple code on how to compose those end-to-end data pipeline solutions, you know, using many different products. So here, what we are trying to achieve with what we are calling the jumpstart library is to provide all the tools, all the productivity tools that engineers can use by, you know, combining those reusable patterns that they will be able to put into motion into their own pipelines. And we try, of course, to illustrate use cases with real-world business cases. So what we are providing in each pattern of these jumpstart libraries is a functioning demo, the full Python code, of course, the YAML files to deploy this onto OpenShift, or, of course, you can adapt it sometime to make it run on Kubernetes, sample data, sample machine learning models, all the automation that you need to deploy this model, and, of course, a full documented do-it-yourself workshop guide so that you are able to reproduce this demo and so that you are able to learn how to implement these patterns for yourself. Right now, we have two of them that are available in different industries, first one in healthcare. It's another demo that maybe you have seen earlier this year or last year. It's about a fully automated X-ray diagnosis pipeline. And now we are launching for this very event. It's the first time I'm presenting this for a Smart City, Green City pipeline example. And we will have three more of them at least coming this year. Of course, we will use, we'll deploy everything on top of OpenShift. For those who don't know, OpenShift is the Red Hat distribution of Kubernetes. And we will use many different other tools either from the Red Hat portfolio such as OpenShift Streams for Apache Kafka, Data Foundation, which is a storage with Rooksaf and things like that. And we have also lots of different open source tools that you can reproduce. So in this demo, you will see that I am, you know, switching from either the downstream products or the upstream products. But as everything that Red Hat does is open source, at any point you can replace some of the downstream components or the one from Red Hat by their equivalent from the upstream. So the full open source version of the tool. So let's go with our scenario, the Smart City, Green City. This is something that we have prepared with my colleagues and Amiga to show you what it is. We will start with the business deal, because this is how normally you would approach a solution in your organization trying to see how you can solve your specific problem. So here we started with this scenario with the city of London, where there is an area around the city where there are limitations for entries for the cars. It's low emission zone. And of course, there are some business trees that each and every city try to achieve, like reducing congestion, reducing pollution, or look at it, wanted vehicles. What you have in common for these three business needs, for example, is that you need to gather data on which, for example, which car is coming in, from where, at which time. And so you have to gather this data so that you are able to analyze it, and maybe train some model to react upon this data, and then adapt your city policy to be able to achieve your business goals. So we'll focus on this part of acquiring data. And our primary data pipeline for our solution pattern is this one. We will simulate around London that we have cameras at different entry points. And of course, they are monitoring the traffic, and they are taking pictures of cars. First stage, what we will do is to recognize the license plate for the passing vehicle. We will see this as a two-stage model where we first extract the license plate image itself. And then once we have this image with the license plate, we extract the number itself. So we will recognize the license plate for the passing vehicle. Of course, this information will add timestamp, the location of each license plate event. Basically, I have recognized this car passing at this station at this time. That's our basic row information. And this information we will send to a data center real-time. Let's imagine we have the core data center with the command office, and we are sending all this data because we want first real-time to notify officials if there are some wanted vehicles for Amber Alerts, for child abduction, or if there are stolen vehicles. That is our data that we want to compute real-time. But of course, we want to make some processing with the data for calculating fees for entering the city, or adding a dirty vehicle fee if this is needed. So this is data that we will use more in a batch manner, like processing the fees every 24 hours, or maybe if we have, you know, historical data from the past few years, then we can analyze this data to detect patterns and things like that. So that's our primary pipeline. There is a secondary data pipeline that we can implement, which is more related to machine learning operations MLOps, because of course, we always want to retrain our models so that they are accurate. So at each tolling location, we will recognize the license plate of the passing vehicles. We'll do exactly the same thing and wrenching the data with the timestamp and location. But sometimes, contrary as we did before, we will also forward the license plate image itself as a random sample, and we will transfer everything to the data center again. Here, as we are storing those images, and we are able to check the accuracy of our model, and eventually retrain the model to approve our prediction. Once this model has been retrained, of course, we can send it back to each of our tolling location, where the real-time inference is happening. So here, we are doing this for the MLOps, because of course, we don't want to send back each and every image into the code. There's no need for that, because we are doing the inferencing at the edge. We are already recognizing the license plate at the edge, but sometimes we want to send the full package of data back so that we can do this retrading. Okay, of course, again, maybe I will say it again, this is a totally fixtures scenario, but you can see that it fits to something that is plausible. Now, it can be the real business case and the real use case, how you will implement it. Although, of course, there are many different ways to achieve the same goals, but here, this is the scenario that we said a lot. This MLOps flow comes into this kind of reference blueprint for those kind of flows where we are first gathering data, then developing our machine learning model, then we can deploy it into an application, and of course, we will always model and manage this model along its lifeline. Okay, so this is part of another presentation that we have at Red Hat around, in general speaking, data science platforms. If you are interesting to learn more about this, reach out and we can organize something. Let's see the technical architecture of our pipeline. So, again, we will have this video stream coming in. So here, of course, I don't have cameras around London, so we will simulate the feed by sending a set of images that we already have. We will send those images to the license plate recognition model, which will do the first stage model that will extract the image of the license plate. Then there is another model that will extract the license plate number itself, and with the added data, the location and the timestamp, this is an event that we are going to send to a topic into Kafka. So we have an instance of Kafka running at our edge locations, and we are sending this data here. We will use this feature, this Kafka mirror maker, to transfer the data from our central location to another Kafka instance at the core. We are using Kafka mirror, oh, no, I will send it. We will go to here. We are using this pattern with the Kafka mirror maker, because that's a neat way to preserve yourself from network issues and things like that, meaning that whenever there is a disconnection between the edge location and the core, here in our Kafka edge location, we will still record data. We will still ingest the data and buffer them, versus them into our Kafka queue. But when the network is back again, we can have, through mirror maker, we can have our central location fetch the data, and it will restart exactly where it left when the connection was lost. So that's that's a pattern here that we wanted to illustrate, so that you see you can put in place those buffers around your pipeline to make sure that you always have, that you don't lose any data from going from one point to the other. Okay, so that's the Kafka part. We have different listeners now to our Kafka topic. First, we have a listener that is looking for a specific license plate. Okay, there is a vehicle that we made it we want to find for different reasons. So here, I'm listening real time to my Kafka topic, and I'm displaying the data if I'm displaying an alert, if there is a wanted card that has been found. Okay, we have another thing happening here. We can make ad hoc reporting using SuperSat, or we can have an engine for the toll processing and everything that will rely upon Starburst Presto or Trino, the upstream version in this in the demo I will show you the way I'm using Trino. And this Trino engine will fetch data from different sources. First, here we have an Astrid bucket, an object storage bucket. And I didn't talk about this part, but there is another component that is listening to our Kafka topic. This component is called Secur. It's an open source project that was created by Pinterest. And it's pretty handy for any data engineering pipeline, because what it allows you to do is to gather data from Kafka, aggregate it, and save it in different formats, or parquet or whatever, and save it to an Astrid backend. So that's a pretty neat tool to do exactly this, gather whatever event you are streaming into Kafka, and persist those events into your object storage. Okay, we have also here a PostgreSQL database that is simulating the VIP registration database where we have further information on a specific car, its model, its owner. And so with Trino, we can query both of those, both of those data sets, both of those data sources at the same time. And then we can display everything in different dashboards into Grafana. Okay, and now it's time to show you the real thing. So here in my environment, if I switch and maybe I will zoom a little for you, so you can see that it's pretty crowded. There are lots of containers running on different Kafka cluster and things like that. But this will be interesting. I would share with you the location of the code and everything. So if you want to have a look at it, but of course we are, as I said, we are providing all the tools and instructions to deploy all this environment for yourself. And what you get at the end is a dashboard like this. So let me walk you through it. So on the big map, see the London map obviously. And what you can see is that we have our different stations. And we are directly displaying the number of cars that have been detected in the time period. So here in Grafana, I'm looking at the last 30 minutes. Maybe I can change it to the last hour. Of course, so you see the numbers have changed. So that's what we are simulating here in this dashboard. I have my data coming in. So car images, the image itself is processed, the license plate is recognized. And then I am able to count the data coming in. Okay. That's exactly what you can see here. I have this small part of the dashboard where there is always a picture of the last detected car. Okay. Also, we can see the dedicated license plate along with the model and with the owner of the car. Okay. Because if you remember when I'm sending this information in Kafka, I'm only sending the license plate number with the time stamp and the location of the car. But of course, I'm enriching this data at the core with the vehicle registration data. That's what allows me to display here some more information. Okay. And you can see also that the model itself is not really performant. It goes, okay, sometimes it guesses the right number. But sometimes there are some discrepancies. So that's why you want to put this other pipeline into motion to constantly retrain your model and redeploy everything at DH16 train. I have here this small panel where you can see the distribution for each station. So each of the stall stations around London. We have here a hit map with the number of vehicles that were detected over the period. So here directly, you can see that, for example, station A1, station A13 and A5201, you can see that there is some more traffic coming in. So that's an example on how you could display this kind of information on the live dashboard for your operators. You can monitor real time how the traffic is going. Okay. And finally, on this part, you have the wanted vehicle panel. So here it's a simple table where I am sending the event timestamp with the license plate number and the station. Meaning that at 6.15, yes, I'm based in Canada. It's quite early here. At 6.15 a.m., this license plate G526JHD has been detected at station A1. Okay. So that's the kind of information that you are about also to present real time in your dashboard. Of course, there are other things that you would want to display. And I've created some other examples. Here on this dashboard, we have a reminder of our workflow. So on the Smart City Edge app, that's where we are gathering the video feed or the pictures that we are sending to the inference API. Once the license plate number comes back, we are sending it to Kafka, to our Edge instance of Kafka. Then Mirror Maker is pushing this to the core into our other Kafka instance where the C core component will gather this data and push it into our S3 bucket that is provided by SAF. And here, what I'm figuring is the CPU consumption of those different elements. So that's another aspect that you want to monitor. So here, it's more an IT ops dashboard. It's not directly related to the business needs, but to the monitoring of your pipeline just to see if there is some more consumption of resources. And here I have also the consumption of memory. And of course, all those numbers, they don't change much, but believe me, they are real time. Those are the numbers that are fetched directly from Kubernetes, from the Kubernetes API, and displayed here into the dashboard. Now, let's go to the last part of this demo. I would go to SuperSet. SuperSet for those who are not familiar with it is a dashboard tool, you know, similar to Tableau or Power BI to create your own dashboards, except it's a fully open source. And what you can do with it is queries. And here I have a saved query that I can use. And I will open it. Maybe I will zoom a little bit for you. So here, this query is quite simple. I want to retrieve the type stamp. I want to retrieve the license plate, the make and model of the car, the owner of the car, which station the car has been detected. And here what I will use is my connection to Trino. If you, that's Trino have the database. If you remember, I have these components. Trino, again, for those who don't know, is a distributed SQL query engine that you can plug on two different sources. In my case, I have two different data source into Trino. One is my S3 bucket. So I mean, you can also query, you know, a CSV or barcode files directly stored into your S3 storage. And at the same time, I have another connection to my PostgreSQL database, the database with the vehicle registration. And because I'm using Trino, I'm even able to do joins on those two different heterogeneous data sources, S3 and PostgreSQL. So here I'm doing a join on those two tables, the event tables, which has only the license plate number and timestamps and the vehicle registration that has all the other information. If I run it, it will take a few seconds. And you can see that I have exactly the result I wanted to see. You have the timestamp, the license plate number. So at this time, this license plate has been recognized. And we, I can see that it's a BMW 3 series. And the owner is Maria Harris. It has been detected at station 813. Of course, don't worry, all those data are totally fake. You know, there's no Maria or maybe there is Maria Harris in the world, but definitely this is totally synthetic data that I'm generating here. And we're going to have a look at what happens behind the scene. Here, this is my Trino overview. And I will just display the query that we run here. This is this one. And you can have, as you see, many more information on how the query was run, how much time it took, the consumption of resources, the number of rows that were extracted. You also have a live plan of your query. So you can see that here it fetched on stage two about 1,000 rows. On stage four, that's from the database, the vehicle registration. You see it's only 28 rows because this is a simulation. And then it did the join and you have all this information. We did about 72,000 rows coming in from this join from which we extracted 1,000. But of course, everything, this optimization of the query, the running of the query, everything is taken care of by Trino or the downstream version, which is Tarbus Presto. And from your perspective as a data engineer, you only have to connect to your supersets environment and you can run those type of queries. The good thing is that because Trino is a distributed SQL engine, it will have many different workers, meaning that you can query terabytes of data. Even if the query takes a few hours to complete, you can definitely launch it to Trino. That's how it works. I guess that's all I have for the live demo. So let me just go back here to my scenario because it's always good after seeing it live to just come back and see what we have done. Again, we had these simulated video streams with car pictures coming in, license plate recognition model to extract the license plate itself, then the OCR model to extract license plate number. You have the metadata and timestamp that is added. Everything is sent to Kafka, then mirror to another Kafka into our core. The data is persisted into our object storage with C core. We have alerts coming in directly with the listener on our Kafka topic here. And then we have all the engines that are able to query those data. So the Trino slash Starburst Presto distributed SQL engine that is able to query the different data that you can access through supersets. And of course, you can display all this data through Grafana. So that's what we add here. And of course, this presentation will be made available following the event. And you will be able to look at the GitHub repo and see and have a look at all the code that is here that you can use to reproduce the demo. I don't know if there are any questions. You can ask them directly in the chat. As we have a few minutes left, maybe I can directly show you this repo. Let me go through here to the jumpstart library. And again, I will zoom a little bit. So it's under the Red Hat data services organization, the jumpstart library repo. And you can see that we are describing the patterns that we have. So pattern one was the extra analysis automated pipeline. Then this is our smart city scenario. And if I go to this pattern, there is some more description. And then you have all the description, the full description again of the scenario. And if I go there on deploy, you see on the deploy folder, you have all the guide to deploy everything. So all the instructions, all the commands that you have to run, and everything with all the code being available for you on those different folders. You have, of course, also in the source, you have all the source for the different containers that we are using. It's mostly Python code in this case so that you are able to see what's going on. So I see that there is a question in the chat. You can link these images with another system that can check if it has text to be on the road and also if there is insurance. Yes, totally. And here this is the last part of the demo that I didn't show you yet because it's not fully backed. But from all this data, there is an engine that will be run to process the data to, you see, to create the bills that you need to pay. It's processing the different fees for a 24 hours period so that you can, so that people can be billed. But of course, you could also link it to another system to see exactly as you're asking, to see if there is insurance that has been paid or something like that. Conceptually, the way I would do this is to have, if I go back to this slide here, for example, the way I would do this is at this point, once everything is persisted or just listening on the Kafka topic, I could have another engine that just makes a quick API call to this other system to ask, okay, I've seen this license plate number, does it have insurance or not? Okay. You have to have a fast system if you want to do it real time, obviously, but this is totally feasible conceptually. But that's exactly the way that you can enrich those systems. That's what I like in this kind of disconnected patterns. And here I'm seeing it disconnected because, okay, this part on the edge, this is something that is happening and the license plate number is sent. That's it. Then the edge is totally autonomous. We could totally modify this part of the pipeline, change it, and make something else. Here, because using Kafka, I can feed different systems from the same Kafka topic. I have Secor to persist the data into my S3 bucket, but I have here another system for the alerts and I can have a third of force on whatever number of system that I want that is able to pick this data and to do something else, to process the data differently. Another question, how long did it take from start to finish to get this demo up and running? To be honest, it takes quite a lot of time. A few weeks of three people, obviously, that's working 24-7 on this. But let's say it's 25% of the time of three people for, I would say three to four weeks. That's the amount of time it took us to create this. But we started from scratch. We had to find a model. We didn't create this model for license plate recognition, but we had to find one, to package it, to prepare the dataset. We had to devise all the architecture here. We had to understand better how you use Secor to do this, all the configuration, the deployments, and everything for Trino, for SuperSat, for Grafana, and all these construction. Yes, granted, it takes quite a lot of time to set up all of this. That's exactly why we are now sharing those recipes, so that people are able to take not this whole pattern. Not everyone works into a city console and wants to implement this exact same pipeline. But if you are only interested, oh yes, this Kafka to S3 thing is really interesting. Or how do I deploy SAF or OpenShift data foundation onto OpenShift to use it as an S3 storage in my on-prem in my organization? Or how do I deploy Trino and hook it to SuperSat? There are some little tricks here, but all of those tricks are now open source, are available for everyone to consume, meaning that by fetching those different LEGO bricks, you can create your own pipelines from those patterns. And there are other small patterns that are available in the other demo that is also available on the repo, the extra demo, where we are using S3 bucket notifications functions that would trigger several last functions into Kubernetes using Canadian, even thing in Canadian serving. So those are different patterns that you can use. And that's our goal here to provide those different little pieces of recipes, little little patterns that people can reuse later on inside their own, inside their own workflows. Okay, I guess we are reaching the end of this session. I will be available to continue to answer a question in the different chats of the event, if you want to continue to ask questions or have discussions on this. And you can always reach out directly to me if you have any more questions regarding data science platforms in general, and especially data engineering pipelines, because this is what I'm...