 Okay, I think we can start our next session and hello everybody, welcome to DevCon 2022. My name is Andrey Veslov, me and Pavel Yadlovsky are moderating this session. This is the session so feel free to, we will have time for your questions and it will be at the end of this session. Also please use Q&A section. Now I will let speaker share his screen and please begin whenever you're ready. Thank you, see you soon. Yep, thank you Andrey and Pavel. Hello everyone, a very good morning, a very good afternoon and for a few unfortunate people like me who are staying in the eastern part of the world, a very good evening and a very happy new year to everyone. So today in this talk what we are going to discuss is it's not only about how to build a efficient data pipeline for animal detection but also to enable, I mean every one of you to just go back to your homes and build such a framework for yourself and so that you can try, right? So that is the end goal for this talk. If I have to talk about myself a bit, my name is Tuhin Sharma. So I work as a senior principal data scientist at Red Hat. I started my career in IBM in data science. I started my own startup and I worked in multiple companies before joining Red Hat last year. So I'm very excited for this talk and interacting with you and the whole idea is to learn together and grow. So let's start diving into it. So the outline that I decided would be really great to have is in this structure. First, we are going to talk about what is the problem we are going to solve. The architecture that we chose to solve this problem, we are going to talk a lot about the real time processing. The age of big data is kind of mature a lot and now is the age of fast data and high velocity data. So how are we going to tackle that sort of scenario in the architecture. Then we are going to talk a bit about animal detection and eventually we will actually go through different visualization techniques, tools and definitely in the end, I will show you a quick demo. The code actually I open sourced today morning. So I have included the link to the data repository in the reference section of this slide and once the talk is over, I will share the link to the presentation as well in the chat with all of you. So let's start. What is the problem that we are trying to solve? So if we look at the internet, internet is exploding day by day. If we look at 2019, there is an estimated 4.1 billion people out there who are using internet and there is a 5.3 percent year over year growth that we have seen and that is just before the COVID is broken and we can all imagine as we are all confined in our houses what could have been the internet usage over the past two years. If we look at the global penetration rate, it has increased from nearly 17 percent in 2005 to over 53 percent in 2019. So that is some massive, massive growth. So how this internet is exploding and why do we even be concerned about it? So due to the use of so many VMs, containers, different cloud providers like AWS, GCP, IBM cloud, Azure, and eventually the hybrid cloud, the combination of the on-prem and the public cloud itself, the threat surface is actually increasing exponentially day by day. And who are actually responsible for actually monitoring this sort of things and take preventive measures and that person is called security operations control. So if you look at a big telecom company, any sort of big cloud giants, they always have security operations control. So they actually monitor the interactions, the events that is happening over the network and they try to identify before any sort of threat goes out of bounds. But the job is becoming very, very complex and very difficult because of multiple network policies, millions and billions of network requests that is happening every second and the exponentially growth rate of the network connection itself. By the network, what I mean is different IP addresses are talking to each other, different users are talking to each other through the internet. So that is the set of network that I'm actually referring to. So what are those three security challenges that actually bothers the SSE analyst the most? So these are the three things. The denial of service, part of the network stops working because of the bombarding of the request that it receives. The second stuff is data loss, permanent or temporarily loss of data from the network itself and eventually the data corruption. So a lot of malicious agents that come to the network, they live on the network and they craft the data. So these are the different problems that the SSE analyst is most concerned about and they try to monitor the network at the network level and they try to figure out before any such disaster happens. So this problem is not new. This was there and obviously as the problem was there for quite some time, the solutions was also there. So what was the exact solution, the existing solutions that was used to prevent these problems? That sort of problem was solved in a very interesting way. So all these anti malware companies, so they have a big ledger, big directory, big database of the malware. So the malware could be a virus, a warmer road kit, a spyware, a ransomware, etc. And whenever agent or malicious user or IP address, somebody who is behaving rationally in the network, so they come to the system. So that particular entity's attributes are actually matched against this directory of this malware. Mostly they're done from a signature matching point of view. And if there is a match, then that particular entity is quarantined, otherwise that particular entity is just jamming free to roam in the network. So that's how the existing solution was. But what was the problem? So the problem is the evolving nature of the agents. So the agents that are pushed into the network. So every day millions of new malwares are created and injected in the network, right? And there are malwares we have self-programmed to evolve themselves over time, right? So once new spyware or malware actually comes into the network or they evolve into something new, it is really impossible to identify them from a lookup directory or a database of malware's perspective, right? So we cannot actually detect them and they just go do what they're supposed to do and everything goes awash, right? And that is why machine learning is gaining a lot of popularity in detecting these sort of threats. And the reason is machine learning is not entirely a rule-based thing, right? It is not only a signature matching thing, rather it is a behavioral thing, right? So once the model, the AI engine, right, knows about certain known malwares and their behaviors, it can correlate with the unknown entity's behavior and figure out what is the probability or the propensity of that particular entity being a high-level, low-level or mid-level threat, right? And that is the reason machine learning is actually gaining popularity in this domain. And then how exactly it is doing is if we look at the network events, it is nothing but a series of times, it is a data of time series, right? So every interaction in the network, so let's say an IP address is sending some request to other user and that request has a timestamp associated with. It can have a file attached or not. It has a byte information, right? So all those things are actually timestamped, right? Every event is timestamped. So if we can analyze those network events and we can find the anomalies in them, we will be able to identify potential anomalies, right? And potential threats as well. So we can pinpoint that this particular IP address for the past two hours is behaving irrationally compared to the other IP addresses who are existing in that time duration, right? So the problem is something could be wrong with that. Or there is a need of quarantining that particular IP address, right? So that is the thought process that machine learning can be applied to solve these sort of problems. So far what we discussed is there is a coin challenge that is happening worldwide over the internet, right? And there is a person who is responsible to solve that problem. That person is having certain challenges for sure. And the existing solutions are not really enough to solve that problem. And that's why we need machine learning. So as I'm talking about machine learning, it is not something just a new thing. It has been there for past two, three years. And that's why it is, as I was telling that it is gaining popularity over time. And so that is the state right now we are in. And the next step that we are going to figure out over the coming slides is how exactly we can design such system, right? And how exactly we can alleviate such problems before they actually affect us as at mass scale, right? So let's move to the next one. So we are going to discuss about the architecture. So just to give you a context before we dive into the slides is first we are going to talk about the architecture. And then we are going to talk about the sub-modules in that architecture and how we came to choosing different tools for solving those modules, right? So we are going to follow our top down rather than the bottom up approach. So we are going to see the architecture first. And then we are going to dive deep into individual components and figure out why we chose them, right? So the first architecture is just on a data flow perspective, right? So data sources are actually scattered in the various points. So if we talk about a telecom operator, the data sources could be present in multiple private data centers or public cloud as well. That's why I was talking a bit about the hybrid cloud. That's one of the challenges, right? So data is scattered throughout this landscape. So when we talk about data, what is type of data that we are talking about is different users. That is the username or email ID through which people interact over the network or the IP address. Then the data could be a network event stream, right? Which is a timestamped event. And of course, the byte information also, which is present in each of the requests, right? In case some file is being sent or something like that, then we have that information also, right? So anything related to the network which is being sent or intact between two entities, the entities could be user or IP. That's what we call as the source of data, right? And that is what we seek to capture even to start this whole flow, right? So after we do the first set, right, we identify the data sources. The next step is to ingest that data, right? So as we have identified the data, we have to make that data platform ready, right? So that we can actually fit into the various data pipelines that we have in place, right? And that's why the second step is data ingestion, right? So as I'm talking about the real-time network events, right? And in this case, when we are detecting network threats, we really cannot wait for a day before saying that, hey, yesterday in the morning, there was a maliceous user or IP that I detected just now, but it's too late now, right? Because whatever malicious activities that particular entity was supposed to do, he probably had done that, right? So that is the reason that real-time ingestion is really important. That real-time processing is also important so that we detect any such threats as soon as possible before it goes out of control, right? And that's why the real-time ingestion is important. Then there is batch ingestion. There are certain data, which is, let's say, the ISV data or something like that, which doesn't change drastically over time, and it is not such time-sensitive as the network event stream, right? So those things can be ingested in batch ingestion. And after we get the data from different sources, right, each of the source will have their own format. So there's need to normalize the data so that the platform understands the format or the schema, right? And then it is platform ready, right? So once the data is platform ready, the next step is the data processing, right? So we have to do data enrichment by including data from various external data sources. We need to do the data contextualization. We need to extract different features, add different features, and so that the data is now AI ready, right? So that we can train different AI models on top of that, right? So data ingestion is to make the data platform ready. Data processing is to make the data AI ready, right? And that leads to the fourth step, right, which is the AI engine, right? So it has, obviously, one component as the machine learning, right, where we learn the behavioral aspect of different entities and the events. And the next point is the incremental training, right? So let's say yesterday, I actually processed 100 million records, right, and based on that, right, trained a model. Today, I have got, let's say, 10 million records that till afternoon, right? So I should not retrain the model on the whole 100 million plus 10 million records, right? I should actually have the model and say that, hey, you are already trained of 100 million records. So I don't want to train you on the 100 million records again. Rather, it is the incremental data that 10 million records has come today. So why don't you improve for finding the parameters based on new data that has come, right? So that is the incremental training. And eventually, the active learning, because the data that we are actually dealing with, right, the main challenge that we are trying to solve is the new malware site or the evolving set of threats, right, which does not have a fingerprinting, right? So in our case, also, we don't know what they look like, right? So it is unsupervised data. We don't know the level beforehand, right? And that's why the need for active learning is there, right? So we are going to keep on going to train the model, but keeping on going to identify the threats. Definitely, we are going to miss some, we are going to hit some. If we are missing some, and we find, identify that these are the threats we missed, then we make them as semi-supervised, right? We level them and that becomes, that keeps on adding to the supervised set of data, right? So that we know that well, this is a behavior of a threat, which we actually missed on the data from the model perspective. But now, as I know, I can actually actively learn from the data and improve the model, right? So that is the three main steps of the AI engine. And eventually the last step is we do all these things really great from the data pipeline, but eventually we need to consume in a human readable format, right? And that is why the visualization is really important. Now, as we look at the real time aspect of it, right? So we need to have a way to visualize the data also in real time, right? And that's why when we talk about analytics, I stressed on the real time aspect of it. And eventually we're going to gain actionable insight. And then we're going to take actions from those, right? So that is the, that is a high level data flow architecture, right? So that's what it looks like. So let's, let's deep dive on the product architecture. So this is what it looks like, right? So on the left hand side, if you see, there's the data sources, the data is coming from multiple VMs or containers, or the network traffic data itself, right? That is where data is coming from. It is coming to the data engineering block. And inside data engineering, you will see data ingestion and data processing, right? So data ingestion is to make the data platform ready. Data processing is to make the data AI model ready, right? So the tools that we're using for data engineering part in this case is the CAPM Plink. So what we are going to talk about the sub modules, right? That is the real time processing part, the machine learning part, the data storage part, individually. And we're going to, we're going to see what are the different options we had, why we chose Plink right over something else. So we're going to talk about that in the coming slides. So after the data processing, ingestion and data processing is done, right? So everything is constantly being synced to a data storage, right? Which is the Pino. It is a Apache Pino DV, right? It is a OLAP. And that is where the data is constantly getting synced as soon as the data arrives on a copper topic. It's just to get synced to the Pino, right? From Pino, we have the raw events. We have the ingested events. And we have the processed events which can be used directly by the machine learning models, right? And on top of that, that is where the AI engine comes in, right? So there are multiple techniques like isolation forest, one-plus SVM, K-means, autoencoder, right? By using tools like Escalar, PyTorch, Spock and a lib, which actually takes the data from Pino. It actually processes the data, trains the model. And the model is actually stored in this MLflow model storage, right? Where we can do versioning of the model. We can compare the same model trained on different volume of data. We can move one, we can do the model versioning, right? We can move from dev to stage to prod, right? So all those cool stuff that we can do is in MLflow. And that is why we have that as model storage. And after doing all this modeling, the scouting also is done on the profiles of these entities, right? And they are getting stored back to the Pino DB. And eventually Pino DB is going to power the superset real-time dashboard, right? So superset is the tool that we use for real-time dashboard. So once we do the demo, it will be pretty clear and what are the different components, how they actually interact with each other, right? So this is the high-level product architecture for handling such situation. So let's dive deep into individual components, right? So in this particular talk, right? So the main stress that I wanted to give was on the real-time aspect of it, right? So there are three components that we are using in the real-time perspective. The first is the Apache Kafka. It is a distributed even streaming platform. And what we're going to talk about is what is the difference between queuing and streaming? Because once we talk about Kafka, we talk about Apache Pulsar, we talk about Rabbit MQ, right? So what are the fundamental differences between queuing and streaming? And why Kafka, right? Why not Rabbit MQ? What is the rationally behind choosing Kafka? So that is what we're going to discuss in the next two slides. So once it comes to queuing versus streaming, right? So the fundamental difference between those two are the way they operate, right? So the message queue is a producer-consumer model, right? So you can have one or more consumers and producers and in a message queue, multiple consumers actually start listening. Once the data arrives, the guarantee is that the message will be delivered only once, right? So that is the fundamental concept behind message queue. Once it comes to streaming broker, right? It is a pop-up model, right? So messages are organized in a long log file or called sometimes or called as topics, right? And one or more consumers actually subscribe to the topic or the log file to listen on the data. So with a proper setup, right? So same message can be actually delivered to all the subscribers without even doing any replication, right? So that is the fundamental difference. One is a producer-consumer model and there is a pop-up model, right? So in the message queue, right? So each message is delivered only once, right? So once it is consumed, the message is gone from the message queue, right? On the other hand, in the streaming broker concept, right? The message is available even after it is consumed, right? Now, in a message queue, it may not be delivered in the same order, but in the streaming broker, it is always delivered in the same order given they are in the same partition, right? So I'm not going very deep into how the streaming broker works, all the partition, the sharding, all those things. But from a high-level perspective, those are the major differences that we can identify them as, right? So the next question is, so even before that, I mean, let's create a poll, right? So what do you think, right? So in this context of the even detection, right, what do you think would be a good way of solving the situation, right? Is it a message queue, right? Or it's a message broker, right? So if you can just, I mean, comment on the chat, right? That would be a good way of actually understanding how you are actually following, right? And whether it is understandable or not, right? So I'll just wait for probably 30 seconds, right? And then probably you can just in chat, put your answer in the chat, right? And then we can just, I mean, revisit what could be the reason of using either the message queue or the streaming broker in this context of network event detection or the anomaly detection, right? So I'll just wait. Yeah. So they will just told her streaming broker. Yeah. Yep. I'll just wait for probably another 10, 15 seconds. Okay. All right. So as we are running out of time, because I have to do a bit of demo also. So I'll just move on with this. So yes, the answer is streaming broker because what we are dealing here is we are trying to do multiple things with the data, right? So the raw event is coming, right? At a subsequent level on a network data perspective, right? We are going to create multiple profiles. We are going to create profiles around the IP addresses. We are going to create profiles around the user information. We are going to create profiles from a location perspective, right? So how many IP addresses are coming from a particular location? How many requests are being generated by a single IP address, right? So all these things we are going to derive from the same, same set of events, right? So it is, so it would be much easier if we, if we follow up up some model where we have multiple consumers, right, which will have their own profile aggregators, logic inside them so that it can read the same stream of data, but do different things at the same time, right? But for, for queuing, if we want to do that, we have to do additional data replication apart inside that, incorporate that so that we can do multiple things at once because the main problem was the one, once the message is delivered to a single consumer, the message is lost, right? So that is the main challenge for the message queue, right? And that is why we chose the streaming broker, right? So if we, if we, if we would like to know why Kafka, right? So, so these are the different features on which we actually did the analysis and calibrated them. So message ordering is supported in Kafka, but RabbitMQ is not supported, right? Because in Kafka, we can just, I mean, read it by the timestamp, we can, we can do that. Message lifetime in RabbitMQ, it is just gone, right? Once it's consumed. But in Kafka, it is already always there. I mean, there is a detention period you can actually set, but you can always set it for, forever as well. The delivery guarantee, right? In Kafka, it guarantees atomicity. So set of records, right? When it is, when you're consuming, right? Either the whole set of records is consumed, right? Or it is gone forever. Or it fails as a whole, right? But in RabbitMQ, there is a, there's a partial failure that happens. That's why it doesn't guarantee atomicity. Hi, Andrew. Hello, Duin. I'm very sorry to interrupt you. And maybe you have already noticed that we are running out of time. So we have to finish unfortunately this. Okay, sorry, sorry, sorry for, for interrupting your session is for one, for 50, for 50 minutes. For 50 minutes. Okay. Apologies. Okay. So please, please go on. Yeah, no problem. Yeah, sure, sure. Yeah, so, yeah, not a problem. Thank you. Thank you. All right. And then the, then the message priorities, right? It is, it is actually not supported in Kafka, but for RabbitMQ it is supported. If we look at the performance with limited resource, Kafka reaches a very high throughput, but for RabbitMQ it is, it is, it requires a lot of more resources, right? So, so that actually wraps up the, the message queue part, right? Message queue versus the streaming architecture, the, the, the streaming broker part. And that's why we chose the Kafka for the, for the first data ingestion piece, right? Next is the data processing. How do we process, right? So, so Apache Flink is something that we use in architecture, right? It is a distributed real-time processing engine. So, so here also we are going to talk two things, which are, which you will always hear when we talk about real-time processing is micro batch processing and stream processing, right? So, and, and eventually we are going to talk about white link and why not spark streaming, right? So, so what is the basic difference between micro batch processing and stream processing? So in micro batch processing, the, the system only checks periodically, right? When the data is coming, and when the next batch window occurs, right? And based on that, it actually processes for the whole window. But in case of stream processing, the system continuously monitors for new event in the system. And as soon as this comes, it just processes it, right? So, and that's why in micro batch processing, there is always a delay, right? So, and then the dealer delay is always has a floor of the time window, right? And if you are creating a data pipeline where each of those components of the pipeline are working in a micro batch fashion, so that delay gets added on top of it, right? And then that's why those delays accumulate. But in case of stream processing, the processing is done as soon as possible, right? And that is the reason streaming processing is gaining very much popularity when it comes to real time processing, right? So, so far, so, so is it bad micro batch processing bad and stream processing is good? So it is not like that, right? So it depends on the use case, right? So, where having most up to date data is not important and the tolerance for slower response time is a bit higher. For example, offline analysis of historical data or identifying correlations, right? Finding patterns in previous month's data, right? So in those use cases, micro batch processing or the batch processing in general works well, right? But for example, like, I mean, financial transaction processing, right? Where the impact is very high, right? We have impact, I mean, the cost of being wrong is very high. The real-time fraud detection, the real-time pricing, right? So in those cases, right, the stream processing is quite better application. So, so why flink, right? Now, so, so flink represents a real-time processing to real-time processing because it is the first two streaming framework with all the advanced features like even time processing, water market, etc. Of course, there is, there was Apache storm as well, right? While talking to the talking about the, the, the, the mini batch streaming, right? So that is the spark streaming, which is not truly streaming, rather than it actually slices the whole stream into multiple chunks and then the processes them with, with the batch processing approach. So when it comes to data latency, it actually flink actually processes the data at a, at a row level, at a record level, right? But in case of spark streaming, it processes the data at a RDD level, right? Which is the basic unit of data in this spark nomenclature, right? So when it comes to adjusting different parameters to make it work in your pipeline, it's mainly auto adjusting for flink and not too many parameters that you have to tune. In spark streaming, there are so many parameters to tune and that's why it is very hard to get it right. So if you don't have a expert DevOps or who understands spark really well, it is very difficult to get it working, right? In the setup. Now, now the community support, now flink is not as big as a spark, but it is growing at a rapid pace, but spark streaming in the spark community is huge, right? And hardly there would be any data engineer who didn't hear about spark, right? So these are the differences between those two tools and flink is something that we use because we wanted to do it at a true real time perspective, right? And that's why we chose flink over spark streaming. All right. So what we covered so far, we covered the data ingestion part where we talked about Kafka and flink, right? So we process, we capture the data in Kafka from data sources, we process the data in flink, right? And that's why the data becomes platform ready and eventually AI ready. Once the data is there, so we need to capture that data somewhere, right? And that's, so that's why this is the next component, which is Apache Pinot, right? Which is the real-time distributed OLAP data store. So what is OLAP and why Pinot? And that is what we are going to talk in the next slide, right? So when it comes to OLAP, so OLAP tools enable users to analyze multi-dimensional data interactively from multiple perspectives, right? So it consists of three basic analytical operations, the rollup drill down and slicing and dicing, right? So rollup involves the aggregation of data that can be accumulated and computed in one or more dimensions. For example, all IP addresses are rolled up towards the port level or geolocation level, right? Drill down is a technique for that allows the user to navigate through the details. So for instance, a user can view the behavior of an IP address at hourly level, right? Slicing and dicing is a feature where users can take out a specific set of data, right, and view the slices from different viewpoints. So these viewpoints are sometimes called as dimensions, the OLAP dimensions, right? So the example could be looking at the same network traffic data by IP address, by geo, by date, et cetera, right? So that is what OLAP is. And the next thing that we are going to talk about that, obviously, you know, is a OLAP, but white, you know, right? So it is blazing fast, right? So it gives a very good latency. Now, when it comes to querying the data, either mutable or immutable, so absurd is available in, you know, right? In some of the other DBs, it is not available. It supports pluggable indexing, right? So you can actually plug inverted index, talk tree index, bitmap index, like that. It supports real-time ingestion, so all the popular brokers like Kafka, Apache Pulsar, Kinesis, right? It actually supports real-time ingestion in formats like Avro, JSON, Protobuf, like that. It is horizontally scalable and fault tolerant. It actually has a query language called PQL, you know, query language, which is very similar to SQL, and it gives you the ability to select, aggregate, group by, filter, and distinct all these operations you can actually do. The joins are not available in PINOUT tables, right? But there is a way around doing that, which is using Trino and PrestoDB. You can actually do that. It has hybrid tables, right? So there are two types of tables this whole app has. One is the offline tables, which is a, I mean, and there is an online table, right? So the data is constantly captured in the online segments, and gradually those segments are moved to offline segments, which becomes high availability, data replication, redundancy, all those things actually being handled after that. So if the data is not present in offline segment, then you can just query the online segment, and that gets handled internally by the PINOUT itself. And there's an interesting feature of PINOUT, which is anomaly detection, right? Which you can use third eye to actually run different anomaly detection algorithms on top of the data that you can extort. But in this particular talk, we are not going to talk about the anomaly detection provided by third eye, and that is the tool that it uses. But we are going to talk about the machine learning models that we actually designed and built, right? So let's talk about the anomaly detection. And so we are going to talk about the journey and the tools that we used to build the AI engine as we sign the architecture. So our journey has been like this, right? So we first started with very simple approach, right? So we started with simple rules, right? That if, you know, in an hour, a particular IP address is sending more than certain threshold of requests or more than certain threshold of files, then just mark them as anomalous or irrational entities, right? Then gradually move to Zscore. We capture various features like the number of files signed over time, number of mites signed over time, right? And then we calculate the Zscore and based on that we actually scored different entities. Gradually that was not a really good way of doing it, but that was a start, right? Then eventually we shifted to the K-means algorithm, the isolation forest, one-class SVM, some sophisticated modeling approach, but everything was unsupervised, right? We didn't know what our irrational entity looks like, right? And after that, once we identified certain problems, certain behaviors as definitive that these are the threads, and that's how we actually got hold of the semi-supervised data, right? So part of them were leveled, very small dataset was leveled and a large chunk of data was still unsupervised, right? And that's why we thought that deep autoencoder could be a good way of doing handling that particular situation. And eventually once we had good number of level data, we offered for R&M to solve the problem, right? So this has been our journey unsupervised to semi-supervised and gradually to the supervised way of doing things. The tools that we used for doing this is scikit-learn. So when you talk about the one-class SVMs and isolation forest, scikit-learn is where we have those state-of-art models. When you talk about the K-means for the big data, MS Spark and Alive has that implementation. So that is what we used. And then when it comes to autoencoder and R&M, PyTorch is something that we used. So if you look at in this whole journey, right, it's not a single thing that we used was not open source, right? So everything was open source, that's what we chose to actually use to build this particular pipeline. So everything was great, right? So model is done, model and training is done. So we need to actually manage the model also, right? So MLflow is actually the tool of doing that. It is also open source tool for doing the machine learning lifecycle platform. So what it actually provides is is three things, right? The MLflow tracking projects, models and registry. So MLflow is a platform to streamline the machine learning development, including tracking experiments, packing code into reputable lunch, and sharing them and deploying the models as well, right? So it offers a set of lightweight APIs that you can integrate into any of the UI, custom made UI that you are going to do for the model management tool. So the first thing it provides is the MLflow tracking. So it's an API to lock parameters, code, results in the machine learning experiment and compare them in interactive UI. So UI is inbuilt for MLflow. You can just use that. MLflow project is you can actually, so it's a code packaging format, right, for reputable runs using Condor Docker, so that you can share your MLflow code, the machine learning code with others. The MLflow models is a model packaging format, right? And it lets you easily deploy the same model to batch or real-time scoring on platforms such as Docker, Apache Spark. As your ML, AWS SageMaker, they also use MLflow, right? And finally, the MLflow model registry. So it's a centralized model store, and it gives you an API to actually access the model, move the model to development, to staging or to production, right? Or to actually archive the model, right? So all those things actually gives you. So from a maintenance of model, you have full proof once you use MLflow, right? And that's what we used in this case. So when it comes to visualization, Apache superset is something that we actually used. So it actually gives a near real-time dashboard. So there are three things that it provides. One is the explore. So you can explore the data, load the data in Apache superset, and you can actually use the inbuilt readymade visualization graphs to actually explore the data. You can view your data so you can create your custom dashboards. We will show it in the demo, custom dashboards, and you can actually constantly monitor in near real-time view. And finally, investigate. So if few of you used DB version, right, for which is a data visualization tool, sorry, data exploration tool, right? So similar to that SQL lab is there in superset, which you can use to actually query the data and do all the things there itself. So there are multiple databases that it supports, right? So there is Amazon, Redshift, Druid, a bunch of them. I actually ran out of space, so I didn't include all of them. But all of those DBs, which actually supports SQL alchemy, they actually can be integrated with Apache superset, which can be directly powered by the database itself, right? So you don't have to have a separate rest API layers to actually expose the data for the visualization. So you can do a clip from the database itself and show the visualization. All right, so the next thing is, let's show the demo, right? So for that, I have to share my screen. But before we go there, so let's discuss a bit about what is the demo setup that we have, right? So this is the architecture that we discussed, right? But the whole architecture is not actually open source because the data was sensitive, right? And the models where the, I mean, I would say there are a lot of, I mean, the custom development was done. So the model part is not open source, the data source is not open, right? So that's why what I did is I created a dummy data generator. And as the, there is no point of doing machine learning on a dummy data. So that's why in the AI engine, I just, I mean, assigning dummy levels, right, to different profiles. But everything else remains same, right? The whole framework remains same. So the whole framework is actually open source. The link is there in the slide, so I will share the slide link also after the talk. So this is the setup that we have done, right? So there's a dummy data generator, which is going to generate data in JSON format, which is going to push to Kafka topic. From that Kafka topic, it is going to go to the Pino, right? Once it goes to the Pino, the raw event gets stored from the Kafka topic, think is going to read it, create certain profiles that is in this case, we are going to do a 15 second profile of different IP addresses, right? So that is for the, for the last 15 minutes, how many files it sent, how much writes is sent, how many requests is created, right? So all those things actually the flink is doing. And then that data is getting stored in Pino. That data is being read by the AI engine. And just a dummy level is being assigned for each of those IP addresses as low, medium, high, right? Just to simulate the way of doing the AI part. The obviously, there is no model. That's why the ML flow part, I also kept it out of it, right? And eventually the real-time dashboard, right? So that is also part of it. So the whole code base is stockerized. So if all you have to do is just own the repository, run four commands, and everything will run in your system, right? So we'll see that now. So this is the sample of our net. But if we interrupt you, but now you really have four minutes left till the end of the session. So this is the raw event. That's what it looks like. So it is a timestamp at IP level. And the aggregated IP profile looks like this. So it is a 15 seconds time window. And for that same IP address, we calculate the number of bytes sent, request sent, file sent, right? From a flink perspective. So flink job actually does that. So let's go to the repository. So this is the repository where I have actually put the code. And all you have to do is run this three commands. So I actually have the whole repository in my system. And I have built the images. So all I'll do is I'll just bring up the services here. So the first thing I do is I bring up all the services first, right? So once the services are up, then we can actually see the Kafka, the flink, the Pino and the supersetter as well. So as you can see, it is getting started. So once it started, so we'll just see whether what is the status of those things. So you can see that there is this Kafka, these two components, and these two containers are for Kafka. This is the flink job manager. Pino has three containers, superset one container. This is the task manager for flink. And we have Joe for running all the big data tools here, right? So this is the different service components that we have. So let's see how they look like in the web UI. So we have the Kafka, then we'll see the flink, and we'll see the superset. Okay. All right. So I think one got exceeded. All right. There must be some issue happening. Let me restart the Docker first. So once the Docker has started, then what I can do is remove this part. Okay. So it is done. So I think I can remove let me quit Docker one second. Sorry. Unfortunately, we can't stop this session because we are out of time. Yeah. I'm really sorry I ran out of time. But yeah, so that was it. So what I can do is I can probably just record the demo of this thing, right? And I can just put it in YouTube and I'll just add the link to the slide itself, right? So that you can actually just see the whole demonstration there itself, right? So that way I think you will be able to run it on your own. Yeah. It would be great. And also you can join the work adventure platform so and continue and did no discussions there. So thanks for, no, thanks for your session. It was very interesting. Yep. Yep. Sure. Thank you very much. Yep. Bye. Bye. Bye. Have a nice evening.