 Biographies, that's what happened yesterday, it would consume like the first 10 minutes of the talk, it was kind of funny. Did you have something else to keep going? No, no, that's okay. I am Pete McKinnon, I'm a software engineer in... I think I'll hold this out here, it seems kind of live or hot or something. I think this is picking up the feedback. Yeah, we're going to get something up here quickly. It's not a big audience, I'll just kill them. Any questions so far? This is cute. Oh it does, I'm good. What was your favorite? The entire time. Okay. Hers had a few feedbacks, because we were from the demo. We needed to log into the cluster, we hadn't done that previously, so, as well. I'm turned off, but... There was boom and feedback. Boom. My name is Pete McKinnon from the Red Hat AI Center of Excellence. We need to let the AI know what we want. Oh, we must find them. Good to go, and your mic's good. Hello. Okay, so today's, this morning's talk, is scalable Kafka deployment and open shift for machine learning. So we're going to go through and talk about various concepts, I'm Pete McKinnon as we all know now, and that is Malik Shah, and we both work in the Red Hat AI Center of Excellence. Next slide. So, we're going to do a quick overview of what it means when we're talking about stream processing for machine learning, and then talk about the Kafka project itself in Strimsy and the relationship between those two open source projects. And then Malik's going to lead us through a fascinating demo. And then he'll wrap up and sort of talk about some of the stuff that we learned in deploying this in our internal data hub that we use at Red Hat for processing large amounts of data and integrating with various components in our internal data hub. And I guess some information about monitoring and then Q&A from that. So, next slide, please. So, there's various ways to, as they say, skin and cat, but in this context when we're talking about stream processing and machine learning, we're talking about gathering huge amounts of data, ingesting it and feeding it into a system, whereby the ultimate result will be a model produced from various machine learning sort of techniques and development techniques. So, at the very top we have applications at the edge, IOT devices, databases, and they're all pumping information into a stream processor. And on the front end, for a data scientist, if they're doing development of the model, there's various things that need to be done and it's perhaps more the domain of a data engineer to do take care of the ingest and cleaning of the data, splitting it and then validation. And there's various tools and techniques for doing that. And then there's more into the data science sort of aspect of it. There's the analysis of the data and then coming up with a model using various types of techniques, perhaps the selection for this particular given application. It should be a convolutional neural network or something like that. Finally, you go into model training and then after a period of time when you're doing, you've done model training and comparison of various techniques, perhaps there's parameter tuning for sort of the information that informs the model, not the data within the model itself, but the parameters around the model itself. And then that resolves to a model itself. At this point when we're doing serving of a model, we're sort of talking about inference. So again, from a stream processor that has ingested information out at the edge or IoT or whatever, you can throw that data directly at the model before inference. So there's various sort of paths, if you will, that involve the stream processor. There's model development and then there's the model inference itself. Next slide. So one of the leading stream processors out there, perhaps sort of the de facto these days, is Kafka. It's a project of Apache and it is a distributed fault tolerant stream processing platform. It's JVM based, it's written in Java and Scala. They build it with Gradle. And there is a company that's sort of behind it that brought it to open source, but they also provide enterprise sort of support and value add for that and name of that company is Confluent. So they have sort of a mixed model, sort of, there's open source and there are some proprietary features that customers can get supported versions of to enhance the core of Kafka itself. It's basically represented by four APIs. There's the producer, consumer, streams, and connectors. And we'll sort of go through the architecture on the next slide. But basically the use cases for this stream processor for Kafka is simply as a pub sub type of message bus, right? So you can have message queues and eventing from messages that are passing through the system. It also could be used for highly available data storage. In fact, in our internal data hub, we kind of use it as a data caching layer, basically. And Malik will maybe speak to that more later on. And then there's the processing of the data as it's traveling through the system in real time. And there's APIs for basically appearing into the stream, the message stream, and extracting information from it. There's an add on call case equal for that sort of capability. General architecture, I mentioned the API. So basically producers and consumers, probably most of you are familiar with the general pub sub model. And that is implemented by various different tools and frameworks. But Kafka has become very popular out there in the open source world. And it's actually been highly adopted in the enterprise because of the features of performance, scalability, reliability. But on the producer side, you have some applications that are generating messages through that API into the Kafka cluster. And then consumers taking those messages off the bus. And then stream processors, I mentioned case equal. Those can be written as applications to basically inspect the traffic traveling through the cluster. And then there's a whole portion of Kafka that's devoted to connectors. And there's quite a variety of different types of connectors, different types of databases, various types of, what are some of the examples, Malik? There was some of the connectors that... Well, Kafka 2 S3. Yeah. That's something we use internally and it's streaming around maybe like 50 gigs of data every day. Yeah. Yeah. So that's the overall architecture at a high level. We'll go to the next slide. So the internals, the best way to think of how Kafka operates is as a structure of distributed commit log. So you have these streams of records and they're organized into what are called topics and those topics are partition. And one of the important features, like I mentioned with Kafka is it's highly available and it's resilient. Part of the way that's achieved is by replicating these partitions across servers in a Kafka cluster. So you can configure the number of clusters depending on your requirements for your overall messaging application. Partitions are organized. There's always one leader and you could have zero more followers. Again, it's a decision that you would make about sort of the sweet spot I guess for what you will tolerate for high availability and resilience. So the leader is in charge of doing the read rights to a partition. And then these followers sort of literally that they come up behind and sort of passively replicate from the leader. So for high availability, if a leader was attached to a process and that node went down or went away, then one of the followers in that group can be elected and that's the role of zookeeper of maintaining quorum within the cluster. So that's the high availability part of it and then there's also load balancing. So because servers can act as both leaders and followers, they wouldn't do so for the same partition, but we can assign different partitions so leaders and followers are maintaining the sort of shards, if you will, or slices of the partitions. And again, that means we can distribute the processing traffic across these various servers in the cluster. So anatomy of a topic, we have partition zero, partition one, partition two. And like I say, it's basically commit log. It's writing these records to these partitions at various points in time and then the consumers are reading from those partitions, you know, and consuming them based on where, you know, I guess the model for how they want to consume the messages. Next slide. So we were for Red Hat. Was this all mean? Kafka sounds great. It actually runs great on Kubernetes and it runs great on OpenShift. And most of you are familiar with OpenShift, I would think. So basically our enterprise-grade Kubernetes distribution, it is Kubernetes. It's not a fork or anything like that. It has productive extensions for developers as well as enhanced security for enterprise customers that need to rely on that for their operations. The newest version of OCP as we call it, Red Hat OpenShift, is operator-centric. If you look at OpenShift 4, everything is composed of an operator. It is sort of the defining paradigm going forward for OpenShift and I would say the Kubernetes community in general. So the control plane in OpenShift 4, the infrastructure, it's all defined by operators and those operators are essentially controllers. They are managing a life cycle of specific components. It also has native integration with the operator life cycle manager. And what that means is that you can basically pull in different types of operators for your application and install them into the system. And it basically renders an application for you, something like it could be Couchface or perhaps it's Postgres SQL. It's taking care of everything necessary that you used to have to do as an operator by manually sort of configuring the master and all that stuff. So that's all taken care of for you. The reason we're talking about that is the next slide, which is Strimsy. And Strimsy is a project that was started by Red Hat and it is basically an operator for Kafka. So this is an open source project and the general architecture here is that it's actually a Congress, if you will, of operators. So there's the cluster operator that's responsible for managing the actual broker and worker nodes and the zookeeper quorum. With operators, they're typically defined as some custom resources and perhaps you've heard the term custom resource definition. So there's custom resources that also constitute Strimsy. One is for user operator and the other is for topic operator. And we'll talk about those on the next slide then. So again, cluster operator manages the lifecycle of the brokers, basically the message bus itself, those components, those replicas. That's what we would sort of refer to as the Kafka control plans. And then there's the topic operator that manages the lifecycle for a custom resource called Kafka topic. And those topics can be deployed as part of your application. So when you install, say you built up an application that part of it is the message bus that's provided by Kafka. You can in advance sort of define these topics that will be deployed and managed as part of the application. And then Kafka user. And that custom resource is responsible for defining the authentication for any producers or consumers of messages from the message bus. So I'm just going to go back to this one slide. Yeah, anything you want to correct me on? No. So the demo I'm doing is like the stream processor, which is going to send like one data, probably like every one or two seconds to the solving layer. Oh, yeah. So right now we have open chip running and some AWS server. And I have deployed Jupyter Hub, which is basically a repo for Jupyter Notebooks. We have like three Kafka brokers, three zookeeper brokers. One for me just sense if you know guys who promised us it just like provided the endpoint, it would scrape all the matrix from it. And Grafana is to visualize all those for me just great matrix. And here is a frimsy cluster operator, which manages all of these. So here is a model training notebook. And I'm just going to restart it. And this is all like just getting the dependencies. Do you want to scan the code? Yeah. So here we are getting the like Kafka Python library, which provides us to actually interact with the API and this other standard data science libraries. And here we are reading like a credit card.csv. So this is a credit fraud monitoring use case where we are training it on some 28,000 rows of credit data. And we are splitting it into some, I think it's a 20, 70 split or 80 split or something. And we are then training like random forest classifier on it. What we are dropping is the time and class classes. And actually one is it's a fraud. If it's zero, it's not a fraud. This should take around 30 seconds. So in that time, I'm going to go through the consumer code. Let's just give it a fresh topic name. So one thing to keep in mind is always define a consumer group while writing in consumer just because if you don't specify every time the consumer starts, it's going to get assigned a new consumer group. And the memory overhead of it really builds up because each consumer group is stored on a topic with the consumer offset of it. And it grows exponentially if you have a consumer running every few seconds. And that's something we saw. And here I've just defined a predictor where we provide like a tuple of 29 columns and it would just base on the model, give you a prediction whether it thinks it's a fraud or not. Coming back to this, I'm reading like a record from the consumer as in when it arrives, getting rid of a few columns and things and changing it on tuple and just providing this predictor with that list of 29 columns and see what the model thinks. So it's still training. Last time yesterday I checked leaving. Oh well, we can wait for this. Let's go to the other notebook. Yep, again, the same thing. And here just get the dependencies. Here what I'm doing is read the CSE split into two data frames, which is one is fraud, one is not fraud, just because you have like 400 frauds and 28,500 non-fraud. And there's a fair chance in the demo you might never see a fraud coming up. So split that. This is my function to actually send the message to the topic. It's literally just four lines. Define the produce, like import the produce library, provide it with the server address and define the topic and what do you want to send. And here it's just some logic about every alternate between like send five non-frauds and then send one fraud just for demo purposes. Let's update this tool. We can start sending messages. Sorry. We should piece. So this is like how many non-frauds we have. And this is like 500 frauds in the data set. And ideally it should be sending messages. Let's go back here. Yep. The model strain just printing out a dimensions like it's expecting 29 columns, printing some data about the model. It's random classifier with all of these parameters. And we start consuming. So this is all the messages are already received and it should like get one every three or four seconds. Zero is not fraud. Yep. And this is basically like at a serving layer. If you have some incoming data, you can go via Kafka just so it provides resiliency. Now speaking of scalability, what if we want to scale this thing up? Like you're expecting an influx of data and you don't want to break anything. So going back, we talked about operators and we are using an open data hub operator, which is Kafka like extremely integrated into it. All I need to change is I actually have a bit of this part. Come on. So just change like the broker replicas to five. So as you would have seen, we have three Kafka broker replicas right now. Apply the new custom resource and the port should start terminating in a minute. So it has to go through the reconciliation? Yeah. For the open data operator? Yeah. Just noticing you're updating like the ODA operator to scale. I know. So if it finds, well, it's, I can go through the playbook.yaml. But if it's fine, one specific component that updated, it would only like update that component. So Jupyter Hub and all of those, none of those would be affected. Is that your question? Yeah. I'm just going to restart it. So this sometimes fails with OCP4. Well, not to exist because I'm running the operator on my machine. I'm not using the image. You want to come back to that? To do. Well, we're waiting for the brokers to scale. This thing? It's just going through the reconciliation loop. Yeah. And Kafka is like literally at the end of it. This is the open data hub. So the operator SDK supports the creation of three different types of operators. So you can generate a Golang operator, but it also can be used to create an Ansible-based operator. And there's also a Helm operator. So this is. An Ansible operator. Operator. Yeah. Operators out in the wild. Most of them are written in Golang. There's other toolkits that are out there like Coom Builder for generating Golang. Operator scaffolding. But on this, it has preferences with the operator SDK. It came from CoreOS, which of course is part of the Red Hat family now. So as you can see, it's like the Shimzy operator is shutting down one Kafka broker at a time, spinning it back up. And then it should add two more in about half a minute. So we scaled from three Kafka brokers to five Kafka brokers just because sometimes they expect like more data. This is all on the fly. If you don't know, books should not be affected. It's still consuming and producing data as we speak. I think I understood. You used 48,000 plastic cards for trade taxes. And you're now scaling up your modern production data. Yes. Can you repeat this question? My understanding is that you used 28,000 plastic cards for trade taxes. Yes. And what is the size of the production data? Well, right now production data, what we face internally is around 20,000 messages per second at its peak. Yeah. I'm going to go through that in the next slide. And yep, two more brokers just came up. Let's see. How's Jupyter doing? Still going on. And that's what the talk is mainly about scalability. Like we can change the Kafka configuration and things on the fly, using SwimZ operator, like ODH operator now, without any effect to our production systems. Do you have confidence in the production? No. It's just changing the broken number. And we actually did that internally too, where we onboarded a new customer, which was sending us an extract, the Kui guys. I don't know how much was it, but basically they sent us more data than all of our customers combined in a single day. And during that time, we had to scale up our consumers and Kafka brokers for maybe two days. And it was literally just clicking a button, like plus, plus. And it was that easy. We don't, in our internal data hub, we don't do auto scaling. We're not tied in for me to sort of detecting, you know. No, we haven't done that yet. Yeah. But it's probably in our roadmap. At some point, yes. It's on SwimZ's roadmap actually right now. Yeah. We'll just start this. I said we'll just enable it instead of writing the logic for it. Right. So the rescaling is also the same? Yeah, just change five, two, three, and you should be good to it. It's slightly tricky, which I'm still going to talk about in the next slide. It's tricky when you're scaling down certain things to keep in mind. But scaling up has just changed the number. I mean, your OpenShift cluster has the resources. Yeah, obviously. Or, you know, the scale up. So, yeah, last year, when we were doing development and trying to build out the internal data hub, there was some of those questions about, you know, do we need more memory? Oh, yeah. I remember with the memory thing. Yeah. Yeah, that's also different. So now I'm going to talk a bit about monitoring. The open data hub operator comes with monitoring, like, with it. And if you want monitoring out of the box, all you need to do is set this flag to true, as it's easy for you. And so OCD gets out. So it comes with like a base set of dashboards. So let's just create some chaos. So OCD, delete one randomly. They should get a bit down. Oops, they're deleting. So as we see, a pod failed. It just got rid of it and started spinning up a new pod on its own. And they should have been updated, but I don't know what happened. But as we can see, I mean, this thing is still going on. And as we said, we can handle up to, given our application factor, we can handle up to four Kafka pods failing and one Zookeeper pod failing. So that just, and yeah, that was basically the demo. So lessons learned from production. As I said, it's easy to scale up. But while scaling down, you need to keep in mind the application factor, because you can't just scale down the replica count for the topics and its partitions. It needs to manually assign. So internally, what we do is we keep the replica count to, and at any given point of time have at least three pods of maybe 10 terabytes PVC volume. Each one, that just works. So we can handle up to two pods failing. So the partitions might be offline, but there's still no data loss because the in sync replica count is two. And I'm going to go through it the next time. The consumer group offset topic replica count. It's basically a topic which is created for each consumer group, as I said. And it just keeps a count of this was a last message read. This is the first message in the topic. And this is the last message in each partition, not topic, exactly. Partitioning comes with a lot of memory overhead. So just be mindful about how many partitions you create for a topic. More partitions mean more like parallel consumption of data. But it also means like you need extra 30 to 40% memory overheads. Coming to persistent volume games. If you're running Kafka in like a persistent state, have about like 30% extra volume capacity than what you expect, like for an average, just because if you get a data burst, the pods will fail if it's run out of memory. And then it's like a hassle getting them back up. Zookeepers. So Shreemzy comes with rack awareness and three zookeeper pods run well for maybe like 1000 Kafka brokers without any issues. But it always needs a majority of them available at all times. So given your network setup and your node setup, make sure you have like at least three zookeeper pods, but even five is not a bad idea. Given how prone they are to failure. If you have too many brokers, what we saw internally was two of our brokers were handling all of the load and three of the brokers were just like hanging out in the cluster. And we had to go through like a load balancing. So it's always a good idea to like Kafka balance the load instead of doing it yourself and keying in the values and consumers always define a consumer while consuming. I can't read this enough. Like just make sure you specify consumer group. Don't let it automatically create consumer groups. And this is what our internet application looked like. Like offset topic replication factor three. So each of our Kafka broker has one copy of the consumer offsets. This is a transaction state log to Neumann Cinco Africa. This is what I said. If at least two copies of the partition are available, it would consider it available and it would start producing and consuming data without issues. We are holding the time memory for 48 hours or until it's full. Two partitions each topic by default, because that's how our consumption thing works. Default application factor three. So each Kafka broker would have one copy of each partition. And that's how we can handle two parts failing without worries. Number of recovery tracks for data directly just defaulted to one without worries. This is one thing we found useful auto create topics unable like false so we can just control who we are onboarding and don't allow people to randomly write data of Kafka instance and since it's open to everyone internally. And this is what our monitoring dashboard looks like. This is the row we saw in the demo. On top of that, there is another project called Kafka Consumer Laggets on GitHub, which provides us with one valuable piece of information. How are consumers keeping up? If there's a burst of data, we might also want to scale up our consumer count. So this is something we are alerting on. If it goes around maybe like 3,000 memories per second, let us know we can scale our consumers up. Yeah, that's it. Questions. So in our internal data hub, it's flowing through the system and it's being stored in Elasticsearch and beyond that in the S3. So it's sort of in the middle of all this processing and sending stuff into what you would call a data lake basically and that's accessed by a lot of different groups within the company for the purposes of analytics for model development. And yeah, it serves quite a variety of use cases. Even like persistent lock storage. So when we were scaling up and you said we connect the PVC with it, so when you scale down, do you delete the PVC or is it in your script where when you're scaling up, the PVC itself is created? It's a PVC on its own, it's part of the SHIMS-E offering. Okay, so it's basically when you create it, it creates a PVC and when you scale down. Yes, I mean it should create a PVC if you have volume available. Yes. Yeah, there's no loss of data based on the Kafka architecture because of the replication of the partition. So those PVCs are basically holding redundant information that is captured elsewhere. And that's why Molek was successful in sort of scaling up and scaling down these brokers. The system does self-healing basically to make sure there's no data loss. Yeah. Viva had any experience using that Kafka topic operator? I mean, does it just lie to actually instantiate new topics through the Kube control API? Is that how that works? The topic operator? Yes. So we have, as you saw, we set auto-topic creation to false. Now the only way to create new topics is through the topic operator, you provide it with a Kafka topic custom resource. And I might be able to show them, can we show them? Like internal topics? Are we just topic names? Yeah, it's our internal data. I don't think we're, there's no... No, no, no. No, what he would show you is a flow of information from internal systems. Some of those are like image build systems, various types of build logs and things like that. Yeah, sure. So you basically provide it with like a custom resource of the type Kafka topic. And it's all documented in the stringy documentation. So that's where we actually define. So for like the busiest topic, we have increased the partition count to six instead of, so the topic operator overrides your standard configuration. And it works well. But one caveat is only a select group of users would be able to use the Kafka topic operator because it needs a special load binding of the type stringy admin. Yeah. Again, this is all the domain. This is stringy, which is sort of a meta operator, if you will, for these different Kafka components. So this is an abstraction provided by stringy, the Kafka topic, as well as the Kafka user. Any questions? Thank you. Thank you. MQStreams is the rehab problem about distribution of strings. MQP is the whole other thing. That's what you're... MQ is really... And then there's active MQ. MQ is really... Those three letters are a little too much. What would be the account in the end of... I think I was well-adventured. That's probably the specialty of the end user. Stringtree is open source. If you were doing an action deployment and needed for that support, you'd get a solution for it. By the way, stringy operator, you would find a reason to get open. Dare I say, I can't believe it. I mean, the custom methodology is the same. Yeah, yeah, sure. MQ, do you have any problems with the user? The thing that is going on with Kafka is that it sounds like they've had a problem with the user. That's right. That's the right attack you've got. That's why it's a bit more... It's almost under the hood, so like this. It's literally just great team work, right? Yeah. Yeah. That's it. Oh, here we go. Good luck. You ditching me? I'd love to stay, but I gotta drive to Canada. Are you driving? Which means getting a rental car. I don't know. I don't know. Wow. I'm not sure. Do I need to put that on? So you can use either? The round that's in here. Yeah. Yeah, so just go ahead and put it on. Turn it on. That's the sound check. You just have to put that just above the top of your shirt. Swans. I think our people just get... Oh, just the same thing with the community. You know, just... Do you want to make a little star? Yeah. Okay. That's good. Do you need any other rhymes? Just talk a little bit more. But don't leave the mic where it is. Okay. So what time do you want me to start? Is that good? Yeah. Can you hear me? Or do you want me to put it a little bit higher? Yeah, I think that would be good. Okay. Okay. Is it gonna echo here or... It's fine to sleep this one. Okay. I usually never get a call. It's either from... I get a spam call right before my presentation. That's nice. At least someone calls me. Good. So do you want me to start or do you want to wait for a few more? Yeah, I'll do it. Okay. Yeah, for everyone. Good. Okay. Hey, I'm Prasant. I'm a senior engineer slash data scientist slash... Sometimes I don't know what I'm doing, engineer on the AI center of excellence at Red Hat. So as you see from the slide, it's machine learning for developers and QE. So you can see from the title, I'm gonna start rambling on and on about machine learning. But why developers and QE, why disturb the poor souls and not let them live in peace. So what do I mean by machine learning? Now, if you walk down the hall, not that hall, nobody's there. So if you walk down and ask someone what machine learning is, you'll get n number of definitions. And God forbid, don't ask a data scientist what machine learning is. They'll confuse you even more. So, but here I'll refer machine learning. And I'll use it as an overarching term and relate that to anything related to statistics, like analytics, data, anything that you use with data and try to get meaningful information from there. And coming back to why developers and QE, there are several personas out there, like users, like data scientists, like rocket scientists, the mighty Thor, and even Ant-Man. They all have data. What do you do with that? And what do you do to get meaningful insights into the data? So machine learning is one popular technique to do that. And it lets you get meaningful insights not only into the data that you have, but also lets you foresee the data that you will have in more technical terms that's like the prediction. So I'm going to focus specifically on developers and QE, like explain certain use cases that's specific to development and QE. So let's see. I forgot what I practiced. But so, I mean, when you talk about, I mean, when you want to learn a new language or when you're introduced to something new, you always look for a hello world example. Well, is there like a hello world example for machine learning? That's the field is too broad to ask for that. But think of this as like a hello world presentation or a hello world template. I'm going to show like a tool or a framework that lets you turn pre-built machine learning algorithms into a service and kind of provide an easy interface for you to access these machine learning models through the service and play with your data. And if you want to move further and kind of advance the code, like tailor it to your own...