 Fantastic. So Alejandro Sosedo is, for anyone who hasn't already read his bio, he's the chief scientist at the Institute for Ethical AI and Machine Learning. OK, at least the development of industry standards on machine learning bias, adversarial tax, and differential privacy. So this is bound to be an awesome talk, actually. So he'll be speaking on, oh, where am I here? Oh, you actually got two talks going on at this conference. That's awesome. So this is on real-time stream processing for machine learning at a massive scale. So I will turn this over to you, Alejandro, and have fun. Awesome. Amazing. Thank you very much. And yeah, G, thank you very much for the great introduction as you highlighted tomorrow. I am also doing another talk. I'm doing the keynote, which focuses more on the topics that you mentioned around responsible development of machine learning and just generally software development. But today, we're going to be diving into real-time machine learning with Python. So quite exciting and practical deep dive into a very popular and challenging complex topic. So a little bit about myself. My name is Alejandro. I'm the Engineering Director of Cell and Technology, and Chief Scientist at the Institute for Ethical AI. I'm also a member at large at the ACM. And today, we're going to be delving into the realm of real-time machine learning. Just to give you a little bit more about my current work. So Selden is an open source, primarily company. Our main open source project has over 2 million installations, massive growth, and user base. Please go check it out on GitHub. Vanity metric, but has all the GitHub stars. And working with quite a lot of large brands, basically on mainly what is productionization of machine learning, hence the topic today. And with the Institute, a little bit about that, it's a research center that focuses primarily on the research of topics, as the ones mentioned, primarily around the responsible development of the systems through contributions in standards and regulatory frameworks. So we are actually part of the Linux Foundation, which is quite an exciting piece for us, primarily because that allows us to contribute to a lot of the open source initiatives from a very high level and also pragmatic perspective. But today, yeah, we're going to be diving into the conceptual introduction of stream processing. We're going to be delving into the concept of machine learning for real time and how that in itself is potentially slightly different. We're going to be diving into some of the trade-offs across the tools available, and then we're going to be covering a hands-on use case that will allow you to get up and running with this topic. So specifically in regards to what we're going to be doing today is we're going to be fixing the front page of the internet. So what is the front page of the internet, you say, basically red? We're going to be doing some real time machine learning or Reddit comments. We basically have a playground where we have 200,000 comments that we will be using to train a model. And these comments come from our sites. And the premise of this is that these comments and this data set is the comment moderation data set. So in turn, these are basically comments that have been removed by moderators. So we're going to be training a model that cleans the internet out of, I guess, bad language, et cetera. Of course, it's never as easy in such a wild, wild west. But let's take a trip to the past, or more specifically a trip to the present. And more specifically here, you can see in this architectural diagram that this concept of ETL is often referred to as extract, transform, load. And what it basically means is in the old school, when you have data in a specific space, you would basically want to extract that data, do something with that data, and put it back. This is the whole concept of ETL. Now, from that perspective, ETL is the sort of old school way, traditional way in which most sort of data engineering components have been tackled. Your time is still something that is out of reach for a law or for organizations, and they're still stuck in the batch processing, as it's often referred to. And specifically in terms of ETL, there's variations. You have ETL, you have ELT, you have EL, you have LT, you have many, many different variations of this approach. And the problem with this is that there's not just one that rules them all. There's actually a lot of specialized tools that are fit for purpose. And some are a bit more useless than others, but most of them are actually, a large amount of them are actually very, very specialized, where they are very popular or the primary tool for a specific area. So unfortunately, when dealing with not just machine learning, but just general data analytics, you have this whole, a host of different technologies that you have to interact with, different set of patterns that you have to deal with. Now, from that perspective, let's have a look at how some of those terminologies actually fit in regards to the tools. With extract and load, basically moving data around, you have tools like NIFI and Bloom, which basically allow you to pipe some of the data from a database to another database, perhaps in another format. What you will see here is that those tools tend to also be a little bit ambiguous, because some tend to fall into the category of others. For example, NIFI does actually do transformation, even though its main premise is transfer. Then you have in the ETL components, things like Uzi or Airflow, which you most probably have heard of. And then you have an old school ETL where the human is the T, or the human is the E, the T and the L, so that in itself is also the concept of ETL. And then you have others like ELT, data warehousing components, where you have like an extract and a load, and then the actual processing happens in the database itself. Now, there's now, when we start talking about the future, is when we start going in regards to, okay, today it's all up or yesterday was all about batch, now we need to move into the streaming world. And what is the difference between those two components? So we wanna dive into what does that even mean? And what that basically means is that ultimately, when it comes to batch processing, it's what the name suggests, you actually take a chunk of data and you expect that chunk of data to be processed. Now, in the real time, you would expect at each data point as it comes to be processed. And you can imagine that, if you don't have the infrastructure, that in itself would be much more costly. If you have to like spin off, get everything ready, make sure that everything is there. But fortunately, there's been so many advancements in not only the data processing components, but also in the infrastructure and general computing that has allowed for this real time to be much more of a possible thing. But one of the things is that it's not, it's more often than not batch and streaming. But you never will find a situation where it's only batch or only streaming. There is a sort of combination where both of these components actually fit together. And from that perspective, you would want to have perhaps certain situations where you want to batch something and then as something else may be coming in, you would want to still process it in real time. Or the other way around, you may want to actually process and clean things as they come and then perform batches on the data that is already issued. And the more interesting thing is that now there is a convergence in both worlds where the APIs that used to be specifically used or the programmatic interfaces that you used to use as libraries to deal with batch were very different to the ones that you would deal with streaming because there were different components. But new sort of work in the field has allowed and enabled for this higher abstraction that provides you with the ability to say, well, streaming in a way is just batch processing but in single data batches. But then you can also have window-based batches. And if you take a window that is the size of the entire stream, that in itself is a batch where the entire stream is a batch, right? So from that perspective, the programmatic interfaces, the SDKs, the libraries are also adopting a sort of like unified interface to deal with both problems. So now let's dive into some concept of streaming. The first one that I already touched upon is windowing but then you also have different types of windowing. You have the traditional sort of like tumbling windows which basically says, okay, I'm gonna take a chunk then that data that was seen in this period will be processed and then the next chunk. But then you also have sliding windows which means that they're overlapping, right? You're processing data. And again, you could even say it's batches of data. And in a sense, some systems implemented with batch capabilities, but just running it very often. So from that perspective, is that abstraction again implemented? You have another concept called checkpointing and as the name entails, similar to how you would have it in a game, what checkpointing does is keeping track of the stream progress. And this is important if you have many consumers or many groups of consumers that are consistently being able to like read a specific stream of data. If they suddenly crash, you need to be able to know where did that last actually finish? And the reason why this is important is because that starts introducing some of the terminologies such as processing at least once, at most once or exactly once, right? And this depends on where is the biggest risk for your specific use case, right? So there's a lot of things and concepts that right now we're kind of like powering through to just get a bit of insight on them being able to jump into a use case. And there's a little bit more harder, a little bit harder to grasp concept called watermarking. Whenever you're talking about windows and you say, I want to basically process data from this window to this window, well, then the question is, are you talking about the time where the event was generated or at the time in which the event came, right? And the reason why this is important is because if you actually say, well, I want to actually do windows of the time in which the event actually was generated, that means that the event may come later, right? So you may not take it into consideration in that window because even though it was supposed to, it arrived later. What watermarking does, it basically says, I want to make sure that I keep a buffer so that if some things that were supposed to arrive earlier actually come later than the line, I want to take it and include it into that older batch, right? So that's basically what watermarking is in a simple concept, right? So this is basically some of the high level concepts of streaming. It really encompasses quite a lot on those. And once you get your head around it, it starts becoming a little bit more intuitive. Of course, there's still kind of like state and all of those things to deal with, but that in itself is a good enough introduction to then dive into the next steps. There are multiple different tools that actually tackle and provide you with the ability to do stream processing. Flink, which provides you with multiple languages. They also provide you with the ability to abstract some of those stream processing components. Kafka Streams, which is part of Kafka, is the database that we're going to be using. Spark Streaming, FOST, which is the Python library that again, we're going to be leveraging to be able to do all of our stream processing. And then other components like Apache and Seldom, which we're going to be using for our machine. So now, leveraging today, right? So today we're going to be using this technology. For the stream processing, we're going to store all of the data that comes in on our Kafka queue, right? And the stream processing will be done with FOST, right? The machine learning processing will be served with Core and our machine learning itself will be done by Cycadlearn and Space. So we're going to delve into what that looks like. So what is the traditional machine learning workflow? Well, so in the traditional machine learning, you take some data, you transform that data, you feed it into a model, you train the model and then you rinse and repeat. Once you're happy with the accuracy, you persist that model, right? You deploy that model and then new unseen data comes in and you get new predictions that basically come into us as it goes, right? So first you train a model with historical data and then once you're ready, you persisted, deploy it, new data, new predictions, right? So we don't see data. That's what we're going to be basically doing. So first we're going to be training a model. What do we have? 200,000 comments from angry redditors. Those comments come in text. We're going to have to first convert those texts into something that the model can like read or learn from and then we're going to be taking a model that is going to actually sort of, we're going to train a model with that specific incoming data, right? What is the model going to try to do? It's going to try to predict on whether that comment would have been moderated or not, right? That's basically what the model is doing. So we basically first, we clean the text, we convert it into a, in this case, tokens. We're going to see what that is. We pass it into it to be converted into a vector and then we pass it into our model which is in this case a logistic regression model. But what does that even mean, right? In a more intuitive perspective, let's take a standard civil Reddit comment, right? Somebody that just wrote, you are a dummy, right? So what happens? First we clean it, we pass it through the text screen. So that becomes, you are dummy, right? We're removing all of the stop-works. Then we tokenize it. What happens there? Well, we convert it into tokens, which in this case, it abstracts some of those pronouns into front. Then it basically performs normalization on our data so that we can actually pass it in a much more standardized way, right? So lemmatization, et cetera, et cetera, right? Then we can actually convert it into the vector, through the vectorizer. We then get what I refer to as one-hop vectors of which token we have. And then we pass it into our model that gives us a prediction, a one or a zero. A one is basically that it should have been moderated. A zero is that it wouldn't have. It's a very nice comment. And as we would have assumed, this one would have been moderated if you were to pass it through our model, right? And that's actually what happens if you pass it through the model. Now I'm not gonna dive into the very specifics of training the model itself, but one thing that is interesting is we actually, you actually can actually go to the repository and you can see some insights that were uncovered when going through this dataset, right? You can see some of the breakdowns of the type of features that are throughout the dataset. You can also see some of the models that were compared, logistic regression, LDA, Kenya's neighbors, et cetera, and how they perform. So if you're curious, please do go check it out. Now, we have a trained model, right? So we've managed to basically what I mentioned here is we have this repo. In this repo, you can find the contents of all of the data analysis. If you're curious, do check it out. So right before I actually go into the next slide, I just wanna confirm, can you hear me well now? Yeah, we're good. Okay, cool, yeah, I'll have the VPN on. Yeah, so again, what we're gonna be doing now is we're gonna be taking that model that we already trained and we're gonna deploy, right? We're gonna put it in production. How are we gonna do that? Well, we're gonna need the next few components, right? We're gonna need the stream processing components that are gonna be moving data around. We're gonna need our queue that are basically gonna be storing all of the streaming data in what are referred to as topics. So each topic actually stores a set of messages, right? So that's basically how our stream is gonna be managed. And then we're gonna have this serving component. We're just gonna be serving our model and we're gonna see how each of those get generated, right? But first, we're gonna see the flow. What we're gonna have is a stream of data that is coming in, Reddit comments that basically our stream processor will be taking in and then pushing into the queue into the first topic called Reddit stream, right? Our then next stream processor is going to be listening to that next topic and it's gonna be as soon as one arrives, we're gonna take that message and we're gonna push it into both first the prediction API to see whether that comment should be moderated or not based on the prediction and then on the response, we're gonna be basically send it to our topic where all of the responses are stored. And then if it's a, you know, should be moderated prediction, we're gonna send it to the topic of alert, right? So that's basically what we're gonna be doing. Let's dive into each of the components. So specifically first into the Reddit source, right? So to generate Reddit comments, we're gonna be leveraging first. In first, once you create the app components or if you've used Flask in the past, it basically is something as simple as app equals, you know, first initialization and the location of your queue. That's basically it. Once you have the app, you can define as a decorator exactly what are the functions that you want to be running, right? Specifically this one is not going to be working on the queue. It's just gonna be running every 0.1 seconds, right? So it's just gonna be iterating. What it's gonna be doing is gonna batch a Reddit comment, it's going to create the data and it's gonna then push it into the Reddit stream topic, right? So the app.topic Reddit stream send, right? So that's basically gonna send it to the Reddit stream. Then it's going to go into developing the ML predict. Again, app.agent, we're gonna create a topic, Reddit stream, it's listening to that stream that we just created and all of the tokens are gonna be coming in. So the actual comment is gonna be coming in. We're gonna then send it to Selden. That's basically the next thing I'm gonna show. Then we're gonna get that response whether it should be moderated or not. And then we're gonna be sending it to the prediction topic. And if the probability is higher, we're gonna be sending it to the alert topic, right? So that's basically what this one does. Then we're gonna be actually serving a model. How do we actually serve a model? That prediction function, that prediction function that you saw here, what that prediction function is, is basically a REST client that uses the Selden client to send a request, right? It is basically just sending a prediction request, right? So the prediction request is to that URL which is inside of the cluster. And their response basically is what is going to be forwarded with the predictions, right? And what does that look like with Selden model serving? That's basically taking that code that we just trained and convert it into a fully-fledged microservice, right? Well, that basically looks when you wrap models with Selden, you create a Python class. Everything you put on the predict function is basically what is exposed in the RESTful API, right? So in this case, we are actually passing the input to our model, right? So if you remember that model that we trained before, it's basically just going through all of the steps and returning the predictions. So anything that you send to the request API basically gets passed through all of those steps and then it gets returned, right? And that's basically what that component does, right? So you basically send a request to here that basically gets passed through this predict function that basically goes into the transforms and then returns what our model predicted, right? So that's how Selden wraps models, right? So that's how you can actually do it. Now specifically in regards to that flow, let's just kind of like recap of what happened there. A new comment comes in, right? We created a stream processor that is just listening, that is just actually fetching comments and then pushing them into the Reddit stream topic. Then we have another stream processor that fetches everything from the Reddit streams and then sends it for prediction as a RESTful API. And then all of the predictions get pushed to the prediction topic and all of the predictions that are higher than 1%, sorry, higher than a percentage of probability or that they're predicted that they should be moderated then they get sent to the other topic, right? So that's basically it, right? And ultimately the code require is this basically to wrap your model and containerize it. So that's basically how you build your own sort of like server for the models. You need basically the components that receives the raw comments and sends them to that machine learning service, right? And then you need just the thing that basically is continuously collecting Reddit comments, right? And if you actually like take a look at the repository, this probably would be much more intuitive because you can actually try it yourself, right? I think that's the main thing. One thing to mention is that one of the things that we're doing with Seldom is actually simplifying the way that Seldom works, right? Because right now you need a stream processor to listen to the input data, send it as a request and then push that into the next topic, right? So that's basically how it was kind of like the main push. Right now we're actually releasing in the open source the ability to not just expose the RESTful API but to be able to connect directly to Kafka. And this is quite interesting because this allows you to basically deploy your model and consume topics directly and then push the topics directly. And yeah, I mean, that's something that we're currently working on. For the community, you know, we would love to hear any sort of feedbacks. It's currently an open PR. So do feel free to come in and definitely add your thoughts or anything. And we definitely always welcome any sort of contribution whether it is just on the thoughts or even if it is on suggestions or comments in there. So with that, I think we've managed to go through all of the core components. We managed to cover a conceptual introduction to stream processing. We've managed to dive into the concept of real-time machine learning given that that nuance of having to train the model. And then we've discussed some of the trade-offs across some of the different concepts and terminology and type of stream processing approaches together with the hands-on use case to tackle the challenge of, I guess, angry people in the internet. And with that said, thank you very much. And, you know, I'll be on the discord. We have the talk which you can find, I think if you search for streaming and then if you have any sort of questions or any ideas for extensions into this, more than keen to actually delve into some of that. So with that, I'll pause that and I'll pass it over back to Jason. Thank you very much, Alejandro. Excellent talk. So like he said, please check out his room over there in Discord. And you can just, in Discord, just hit control K, type in the name of the talk and part of the name of the talk, you'll find it and you can ask the questions there. So thank you very much for joining us.