 All right, so thank you very much to everybody for joining this talk on real time machine learning with Python. There's a lot of really interesting content that we want to go through. We're going to be covering quite a lot of topics as we go through, so please hold your seats. It's going to be a very interesting area that I would also love to generate the discussion even after the talk to continue exploring best practices. So you can find my Twitter through this link that will be in the corner. I'm going to be publishing the slides and the code later on today, so if you want to get them you can access it through my Twitter and my GitHub handle is also the same. You'll be able to access them through there. So a little bit about myself, my name is Alejandro. I am the engineering director at Selden Technologies and the chief scientist at the Institute for Ethical AI and Machine Learning. I'm going to talk a bit more about that in a second, but primarily my background is a practitioner, sophomore engineer, across hands-on and leadership roles, currently leading a set of teams for open source machine learning. And more specifically, my role at Selden and Compass is our open source production machine learning deployment framework for Kubernetes. We focus primarily in the life cycle of models once they have been trained and all the complexities that come around productionization of machine learning. So a lot of the things that we're going to be talking about today, it will involve the best practices that we've identified when performing real-time as well as just general large-scale machine learning deployment. Although today we're also going to be diving a little bit into the training side. And also my role at the Institute for Ethical AI and Machine Learning, this is a research center that focuses on developing work towards responsible machine learning. This is primarily through contributions to ongoing standards, as well as working groups that are working in this type of verticals. Recently, one of our contributions was to the Association for Computer Machinery, the ACM. We published a statement on contract tracing applications, primarily towards the technological implications and best practices. So if you want to check it out, please do and any thoughts are appreciated. And the Institute is also part of the Linux Foundation. We're quite excited that a lot of contributions are still in the intersection with open source. But today we're going to dive specifically into conceptual introduction of stream processing and just generally streams. We're going to also be talking about machine learning and machine learning in the context of real-time processing, as well as some of the trade-offs across different tools and different frameworks that you can find to be able to leverage. And we're going to cover a hands-on use case of how to apply all of these things. And as I mentioned, if you have questions as the talk goes on, please ask them on the chat. I'm going to be stopping and addressing questions as I go to each of the sections of the presentation. So in regards to the use case, we're going to be doing real-time processing on Reddit dataset. So Reddit, as we all know, is called itself the front page of the internet. So what we have is a dataset of Reddit comments from our science that have been removed by moderators, and we're going to be building a model using this dataset. And we're going to be aiming to fix the front page of the internet, because we can all agree that the internet is not full of just positive stuff, but at the same time, we also want to dive into some interesting things that we can see on the exploratory data analysis of this dataset. So of course, we're not going to have time to do all of that, but after the talk, you can actually dive into some of the data exploration on the Jupyter notebooks. So before diving into streaming, we need to take a trip to the past. And really, this is not even a trip to the past, because this is the present, right? And the present is ETL. What is ETL? ETL is, as you can see in this architectural diagram, the concept of extract, transform, load. That basically means you take a dataset, you do something with it, and then you put it back somewhere else. And this is how the data processing world has worked since the Stone Age, right? Ultimately, it's not about real-time events. It's more about you take a batch, you do something, you put it somewhere else. The reality is that the world still functions this way. But at the same time, we're moving more towards this real-time thing. But I want to dive into this concept of ETL, because it's still so prevalent. And the complexity doesn't only come with this ETL, three-letter acronym, is that you only have ETL, right? You also have ELT, you have EL, you have LT. And you have tons of different types of combinations of this stuff, right? And the problem is that they're not all just terms that people throw away just because they feel like it. Ultimately, although sometimes that is the case, we all know. But the reality is that these terms actually do refer to patterns of data processing that are present in industry. And this separation is still important. And there are a large range of different tools that are available at your disposal for you to be able to leverage each of these different things, ETL, ELT, ELT, et cetera. Of course, there's a large number of tools that are useless. But a lot of these different tools, they are very good, very specifically for certain areas. And they're best when facing specific challenges, right? And more specifically, some of the tools that fall into this, in the context of extract and load, instead of transform, is things like plume or nifi, a patch nifi. Plume basically allows you to move data from different databases. Ultimately, nifi, as you will see with other tools, it tends to be very ambiguous of the division. Because, for example, with nifi, they also leverage a lot of the transform step. So although it is used a lot to move data around, it also can be used for actually transforming it. Now in the extract load transform, that would be more on the data warehousing perspective, right? You load your data to something like elastic search or some data warehouse and once data is loaded, that's when you actually transform it, right? And then there's others like ETL, which is the sort of more common extract transform load, which encompass all of the different areas. And these are tools like Uzi or Airflow, if you have heard of that. But today, we're going to be using a good old ETL step, where the transform step is a human with a Jupiter node, right? So that's going to be, in our case, our non-real-time process. And what we're going to be doing in the context of ETL is going to be the training of our model. And the reason why I'm talking about ETL and Batch is as a step towards this concept of streaming. And the reason why is because Batch and streaming ultimately get brought up in the context of, is it one or the other? When should you bring it? When should one be better than the other? But in reality, what we're starting to see is that, of course, you have some tools that specialize into the streaming world and you're going to want to be leveraging them in the context of eventing or real-time processing. But there is a larger and larger intersection of both platforms, sorry, of both concepts and even more of an overlap that tools try to unify those. And more specifically, you often find yourself where you actually need both. And today, for example, we're going to be training one model, which is not going to be real-time, but then we're going to be actually putting that model in production for stream processing, right? And it's ultimately going back to that massive ecosystem that we were talking about. It's not about just running with a hammer, trying to find a nail, but it's instead trying to understand what is the right tool for the challenge, right? And then as I was saying, right now there's this massive trend within the streaming world to unify both worlds, Batch and streaming, right? The streaming community is trying to bring in this concept that when it comes to stream processing, if you take a batch of data, it would be the exact same as if you were just streaming that. So if you take a file and you pass it through to a stream processor, that would be still in the same sort of framework that could be used. And at the same time, when you take a real-time stream, let's say for example, events that are coming out of a system, you may want to also take batches that would be used for processing. And tools like Spark actually do this, where they actually use this concept of micro-batching in their stream processing, where as you're aware, I'm still going to go into detail in case you're not as familiar, Apache, that technology, what it basically does is it allows you to actually process large-scale sort of data sets, right? Apache Spark is able to distribute horizontally that massive data set that may not fit in memory and then process it and get some insight. However, when it comes to Spark streaming, what it basically does is it actually takes chunks of the stream and then sends it to Spark so that it actually processes it in micro-batch. And the stream community, they basically say, this is not real streaming, right? It's fake streaming, that doesn't count. I do have to say that in the most recent versions of Spark streaming, they actually introduced the ability to be able to process in actually every single data point as opposed to micro-batches, right? So that's just one thing to mention. But yeah, what is interesting here is that there is a lot of push towards this convergence of worlds. And we're going to cover this into a little bit more detail in a second. But before diving into that, I want to dive into some of the concepts that are specific into streaming. When you get a stream of data and you actually are processing that data, you have new concepts that, you know, are quite common in that community but are important to gain as foundation so that when you approach a challenge, you can use those best practices. The first one you have, you know, most probably already come across that is the concept of Windows. This is basically the ability to take on a specific window of data points and be able to actually perform a computation that could be a stateful computation across that specific batch, right? So in this case, you can have tumbling windows which is basically non-overlapping windows or you can have sliding windows where basically you actually do overlap on the windows that you gather. And this is where the point about the unifying worlds come in which is basically saying, well, you know, you are dealing with a stream or you can actually convert that streaming problem when it's necessary, of course not always into this sort of like micro batch problem specifically when you want to actually gather, perhaps let's say an average across every second, right? Or something like that. And one of the things that we're gonna cover in two slides is specifically the concept of, you know, the time of the events themselves, right? So you may want to have the actual sliding windows or tumbling windows on the time that they arrive but perhaps you may also want to process, to make these windows on the original event time, right? Because remember that, like the time that an event arrives is not the same that the time the event was actually generated, right? So perhaps your windows are one second, not on the time that they arrive but one second on the actual event time, right? So, and this is something important, bear that in mind because we're gonna actually cover that in a couple of slides and that's gonna be an important detail. The second point is the concept of checkpoints and checkpoints are basically like how you have it in a game, right? In a game, you know, when you reach a checkpoint, if suddenly you die later on, you return to that checkpoint, right? So in this case, that's the same intuitive concept, not the same, but it's an intuitive way to understand that concept, which is basically that you're able to keep track of consumers basically where they were able to actually read last, right? And this is an important concept when it comes to streaming because when you actually have a consumer, you can also have perhaps a consumer group, right? So you may have multiple different consumers actually reading from a queue and you may want that the checkpoint is for the entire group. Now, checkpoints, the other important thing is to understand when you're actually marking that checkpoint. Are you marking that checkpoint on the moment that you read it? Or are you marking that checkpoint on the moment that you're actually processing it, right? Because ultimately when it comes to the commitments that you can get when it comes to stream processing, you can actually have things like at most one delivery, at least one delivery, at least once delivery and exactly once delivery. And as people that have actually worked with streaming processing before, the exactly once is the hardest and it's a really hard challenge that it has to be reassessed and it has to be really thought well when implemented. There are some data streaming processing frameworks that offer it, of course with some constraints and assumptions that need to be in place, like Apache Kafka that allows you to define these things at least once, at most once, et cetera, et cetera. Of course, as you approach your use case, it's important to bear in mind which one to use. And now the key other point to cover is the concept of watermarking, right? So we talked about processing time versus event time before. And watermarking basically says if you're perhaps doing a window computation, say perhaps you want to have the average of some attribute on every second, right? Your watermarking is basically saying how far are you going to be able to wait in regards to holding a buffer for you to be able to count events that arrive late, right? So you're processing on windows, not on processing time, but on event time. If an event that was supposed to come five minutes ago arrives right now, if your watermark is still seven minutes or anything above those five minutes, then you'll still be able to take that and then bring it to that window, recompute the window and have that to be able to use as your window processing. So the concept of watermarking is an important one, primarily as it allows for the ability to consider events that come later. And again, this is a component that has to be taken into consideration when building your stream processor, et cetera, et cetera. And some frameworks provided out of a box, some others, you know, you have to like build it. Some of the tools that are available at your disposal for stream processing include things like Flink, Kafka Streams, Spark Streaming, Fost, Python, and Seldom, which we're going to be using in this specific context. But before that, before we actually talk about the processing, we talk about what is the thing that is being processed. And for that, we're going to be talking about the concept of the machine learning workflow. And this basically consists of two basic streams. The first one, which revolves around the training of a model. Basically, you get some data that you want to use for a model to learn from and you convert it into features that the model can actually need. And then you are able to iteratively train the model until you're happy with that specific model. Once you're happy with the current accuracy or performance metrics, that's where you can deploy the model and the new unseen data can be processed by that model that you have been trained. Specifically in our use case, we're going to be using the two sets of Psychic Learn and Spacey. And what we're going to be doing is taking data set, which is basically Reddit comments and predict whether this would have been moderated by the admins of the page. So what is going to be happening for our model to be able to predict and learn from this data, we're going to have first the text as it arrives. We're going to be using a text cleaning component, a tokenizing component, we're going to see what it means, which is using Spacey, a vectorizer component and then a logistic regression machine learning model. What does that look like? Well, let's see in more practical terms. If we have an input data that says, you are a dummy, what happens is that goes through our text cleaning, as you can see, the first stage removes the symbols, then it goes to a tokenizer, which is basically using Spacey. This converts the string into actual stream tokens, which in this case it would be pronoun is dumb. And then from that perspective, you pass it through a vectorizer, which basically converts this component into vectors that then the model can learn. And then you can pass it to the model so that it can train that specific model, which in this case it would be predicting whether it's true, it should have been moderated or false, it should not have been moderated. So this is basically the pipeline that we have trained. If you're interested on more about this data set and the model, we actually have a Jupyter notebook that delves into the exploratory data analysis. And this actually has a lot of really interesting insights of the data set that we used. We can delve into that for a little bit more detail. Now we have a model that we've trained, there's the question of how do we go from either the model weights or the code that we just created into a fully fledged microservice that has a RESTful API or a GRPC API or in this context, a Kafka API, a Kafka interface, producing metrics, producing logs with all of the components that a microservice would require. And this is what we're gonna be using seldom core, which is this framework that allows us to convert a piece of code or artifact into a fully fledged microservice. And basically the general steps that are involved to actually containerize code is to first encapsulate it with a Python class that has basically an initialization where it loads the actual code or the actual model and then a predict function, which is where we actually call our model. All of the input requests that you send are basically the ones that are passed to this function and the whatever is returned is basically on that microservice API. So ultimately we're able to use the Sylvan utilities to convert this wrapper into a actual Docker image that can then be deployed into the Kubernetes cluster that then can receive REST requests or in this case, the actual Kafka streaming topics to process from. And if you're curious, you can actually see in the link below is a full example of how to containerize this model which is basically a Jupyter notebook that will guide you to do that. The way that we actually build a wrapper, if you remember, we actually trained a couple of models in this wrapper in the initializing method. It actually just loads the pickle or in this case, deal. It loads the trained model that we actually created and in the predict function, whenever you send a request or whenever a new data point is added into the Kafka topic, then this actually passes all the way through each step of that pipeline. So the text cleaner, the tokenizer, the vectorizer and the model to actually return a prediction. So this is basically building upon all of the different concepts that we touched upon earlier in the slides. And now, okay, we have trained our model. We have containerized our model. How do we deploy it? So now in this case, we're gonna be looking at what the actual architecture is. And what we're gonna be having is we're gonna be using the Kafka queue with an input topic and an output topic. And as we deploy our Seldom model, this Seldom model will be able to consume using the Kafka native interface, will be able to consume from the input topic and produce the output into the topic itself. And again, you can find the third example and last example of how you can actually use the Kafka interface in this notebook, but what we're gonna be seeing here is how do we actually deploy this? And with Kubernetes, as you may already be familiar, you deal with the actual YAML definitions. In this case, Seldom provides you with a YAML, in this case, it's a custom resource. Here we define the name of our deployment, which is called Reddit Kafka. We actually defined that it's using a Kafka protocol. And then we actually specify that it's using the NLP model, that the NLP image that we built in the previous slide with the tool sets. And ultimately, we specify how it can talk to the Kafka cluster using the input topic, the output topic and the brokers. And then finally, you can actually specify that this model is the one that is being used. With Seldom, you can actually define complex graphs with multiple nodes, et cetera, et cetera. But in this case, we're only using a single node deployment. Once we actually provide that, we're able to then produce some data, in this case, using the Kafka console producer, which then as we pass some input text, it would then be going into the input topic. It would go through the actual Seldom model and then it would be produced to the output topic, which in this case, we would be able to visualize using the Kafka console consumer. So in this case, what we're able to actually get is all of the processed data points that have gone through the model. And even though right now, we actually covered a broad range of different components, all of this is available for you to actually try out in different steps, right? So if you're more interested more in the machine learning side, you're able to delve further into that as opposed to if you're more interested in the deployment side, which it's all covered. Now, the key thing about this is at scale, right? Being able to scale this architecture horizontally and vertically. As you know, Kafka can actually scale in regards to number of brokers that allows for massive throughputs. And then similarly with Seldom, it allows for horizontal scalability, which means that you can have multiple replicas that have a consumer group that then all are actually ingesting from the input topic and then ensuring that if one of the broker dies, then you have the different brokers. If one of the actual microservices dies, then you have the multiple different brokers with the consumer group. So again, this is very much a high level overview of the intuition, but you can delve into some of the examples that are provided in all of the links below. And just as a reminder, the slides are available on this link, bit.ly slash Seldom Kafka. And these are the slides that will also contain the links for you to be able to access all of the examples. Now, with that said today, we covered a conceptual introduction to stream processing. We delved into the machine learning model and the training. We covered some of the tools available and then we delved into both the containerization and the deployment of the Kafka pipeline. And with that said, thank you very much for everybody that joined this talk. And I look forward to taking questions. If anything, you can feel free to contact or reach out as I would be more than happy to provide any insights. And as I mentioned, all of these examples are open source together with the projects, the underlying projects. So if you think of any potential improvements or anything that is in your mind, feel free to open an issue on the respective GitHub repositories. And again, thank you very much. And yeah, looking forward to delved further into the discussion.