 Hello everyone, I didn't have the phone, we didn't expect it, but it really happened. So my name is Katya Doreyha, I'm the principal software engineer of Doreyha. My name is Katya Doreyha, I'm the principal software engineer of Doreyha. And today we're going to talk about how we wanted to implement the machine learning model, when they have a direction that is simpler, statistical. So this is the agenda for today's talk, so we're going to talk about some possible challenges you might face when training a model. And then even after you have a model trained, you still have a long way to go to bring that model to production. So we're going to come back to that slide at the end of the piece. So if you've been following AI news papers in the last five years, you might have noticed that there's a lot of information about the algorithms itself. But there is way less empathy on the data. In the last year, the emphasis actually has shifted to the models, which is for the algorithm and the data. But there's still a lot of information about that. So today we're going to talk about different same algorithms with different data will perform differently solving different problems. So we don't want to talk about the algorithm themselves because there's a lot of information about that on the Internet. We want to focus on the challenges that most people would face trying to bring it to production. We are going to tell you about our experience. So we're going to give you a bit of context. We are part of OpenShift, which is an enterprise solution by Rehat, probably most of you know about it. It's basically for operating containers. What we do is getting a bunch of data from our customers in OpenShift clusters. We process that data, generate a couple of graphs, notifications, emails, and recommendations for the customers to better manage their classes. The last project that we participated in is the update risk predictions. It's the project that we are going to talk about today. It basically predicts whether an OpenShift update will break the cluster or not. That's the point. So before starting with our engineering model, you first need to make sure that you understand the problem that you are going to solve. Otherwise, you will be completely lost. So let's talk about the problem statement first. So first thing you need to identify what is the actual problem your model will try to solve. Operating in a system is no common to the cost. But it's even higher cost getting that system out of a failed state. So because you have time emergencies, you have a user who has pre-made you. So it's usually very expensive for the brand, with money and for you. So these are some examples of real life failures. So for example, imagine having to deal with a water leak on a Monday morning when you need to go to work. Or an airplane engine failure during takeoff. So another example, I'm not sure if I found the right icon, but this was supposed to be tsunami or earthquake. So tsunami or an earthquake in a highly populated area. This one is an example for a bridge collapse because there are either too many cars or it was un-leaning. So all of these can be mitigated or monitored. For example, airplanes usually have regular inspections for everything like engines, landing gear, and so on. The same for bridges. But for example, for earthquakes and tsunamis, we have monitoring. So usually people, some scientists, or even public agencies, will send for warning when they detect some shakes. So it's way cheaper to try to prevent failure than deal with the outcome of that failure. It can be expensive both with money and then we see an airplane, earthquake, and even the bridge collapsing example, it can cost you the lake. So we are not going to be that... So we are not dealing with human life factor, but we try to save the money for our customers. Some examples close up your home since it's deaf-conflict. For example, imagine a service failure to serve a class because it's out of resources. So you can mitigate it by setting up monitoring, but we'll automatically scale it up and down, or you can just compare it. Another example is you try to deploy a new version of your software and then late in the process, you notice that there are some fail dependencies. So you can use tests, M2N tuning tests to detect that sooner, CI CD, or you can use state environment. Or you can use all three of them if you really want to prevent these kind failures. So let's think about modern cars. So modern cars are both combination of hardware and software, increasingly more, especially example, I think everyone knows is Tesla. So they have hundreds of sensors in those cars measuring everything. And those cars, they send a lot of telemetry to the manufacturer. So example of those sensors is if they can measure the state of the battery, the state of the brakes, the tires, if there are any oil leaks, what is the oil level and so on. So there is a lot of that information. We probably have here people from different countries. Probably in every country, there is a car checkup. So when you bring your car to some government approved agency where they check your car and tell you if you're allowed to go on the streets or not. So usually it was like once a year, four years depends on the age of the car, at least in Spain. So the way that car checkup goes is you take half a day of work, you drive there, you wait there for two hours, true story. Then you go through the car checkup and then at the end of it, the engineers as well, your brakes are not really working, you have a problem with discs, or you have bolt tires. So after that, you take your car to the mechanic, you fix it and then you go for that checkup again. So basically you have days of your time wasted that could have been prevented if there was proper monitoring for the car. So what if we could do something different? What if your car were able to tell you if it's going to fail the checkup and what is that you need to fix before you actually go for that checkup? Or closer again to our example, what Juan was showing, our application was we wanted to predict the upgrade failures for our software. So what if we could predict those upgrade failures? Because after an upgrade has failed, it's usually an emergency, especially if it affects customer workloads. So a lot of people need to jump in, spend a lot of time. So what if we could predict those upgrade failures? So now that you understand the problem that we are trying to solve, every machinery would depend on the data set you have. So the typical data set that you will have from a machinery project looks like this. This is an example from a typical problem it's about predicting the sale price of a house based on the size of the house and the number of bedrooms. If any of you have tried to cover this problem, in machinery you have features and inputs. The features is the data that you have in the real life and the label is the data that you want to predict. In our case, we have a bunch of areas and issues and we want to predict whether this will make the cluster fail or not. So, apart from the front statement, the data set, make sure that you are bringing some business value to the command. Because you are just spending resources and money and not making anything valuable for it. So these are the three steps you have to follow before starting a machinery project with the rest of the path. And this was just the beginning. So you identify that you have a problem, that you have some data to try and it's probably worth at least researching how to solve it. Then after that you have a head from someone in your company saying, okay we are going to spend some time solving that issue. You would have to train a model that actually resolves the users problem and then you still need to bring it to production. So there was a talk earlier in this room where they showed that actually when productizing the code for the model itself is like 10% of the whole model challenges. First thing before you actually write any code, well you can try playing doing some EDA but we really recommend to think what does success mean to you. So what is the definition for success for this project? So we think that it's really important to get your business products, your customer facing people, your data scientists, engineers or people or anyone that would have to participate to make the project successful to get them into the same room and don't let them out before they agree on the acceptance criteria for the project and bring pizza because it will take a lot of time. Then you have to agree on those acceptance criteria. So as you have probably heard AI, ML, it's all the hype right now but under the hood what it actually is is just creative math. So it needs some optimization mechanisms and metrics for the model itself or the algorithm to understand the worse and the better outcome to use it for the next training iteration. So the choice of a metric, so we're not going to talk too much about metrics because it deserves its own talk but in our case in general, your choice of metric will depend on the problem that you're trying to solve and the data that you actually have. So in our case, we have way more upgrade successes than a great failure. But we also don't really care about predicting upgrade successes because well, it's like with HeartChica, who cares that it's successful, we really want to know what it takes. So we have another anomaly detection type of problem. So we really want to identify upgrade failures. What we really do care about is we really care about minimizing false positives. I use this right one for minimizing. We really care about minimizing false positives and we somewhat care about minimizing false negatives. I know these are fancy statistics words but what it means is in this case, so we're trying to predict a great failure. False positive is we predict that the upgrade would fail but it actually succeeds. And false negative is the other way around. So we predict that the upgrade would succeed but it actually fails. The reason we really care about false positives is if we predict that the upgrade would fail but it actually succeeds, it brings it can hurt our brand, our perception of the product by the customers. We see that in your portal that the upgrade is going to fail. This is your product, so your kind of responsible for the upgrade is that product. So it might hurt us as a brand, as a company and the other reason is it might discourage the customers from upgrading to the latest version and we really want them to be on the latest version. The second one, false negatives, we still kind of care about that but with Kubernetes we have accepted that it's not always possible. So false negatives is when we predict that the upgrade would succeed but it actually fails. We cannot predict all the configurations that people use for Kubernetes. So they can use different cloud platforms that can be on-prem. We cannot control the DNS setup that they have so we cannot predict all 100% of the failure. So we accepted that and we really try to minimize so we try to be really precise about it. I also put here the official code the names for each of those metrics. We really care about precision and we somewhat care about recall. If you have taken some statistics classes before you might know that they are related to type 1 and type 2 errors in statistics. So do we have AI, ML, enthusiasts here? So we didn't want we were debating if including this slide or not but because we try to avoid fancy long formulas because we wanted to keep it really high level but essentially this formula is the expression of this. We care about precision more than recall and this formula expresses this score and here by choosing 0.5 coefficient we specify this means that we care about precision twice as much as about recall. So this is our chosen metric for this project. It's always better to have one single metric to guide your project because you cannot train a model using several metrics because it needs to know what has to optimize for it. Then the next step is you need to create some sort of baseline. So it can be either some crude heuristic so you can use a script that is using some assumptions so it's not machine learning yet but you are identifying the acceptance floor and ceiling for what is actually possible out there. You can also use humans like human labelers to see what is possible. One example is we've all seen this before recaptures so before we usually use some either API last time I've seen it a couple of days ago with ChagYPT when they ask you to identify what do you see, they find all the bicycles, boats, whatever. So but even for humans it's not always possible to identify what is in the picture. So 100% accuracy is not always possible. So even for 100% accuracy for your machine model is also usually not acceptable. So yeah, it took me a while. So the reason you created baseline is again to establish the board, so the ceiling and the floor for what is kind of possible you examine your data so you know what is there instead of just shoving it into some algorithm and the second one it either makes or lets you talk with the subject matter experts. So they can guide you, okay but based on our experience this doesn't really make sense. So you can create intuition while working through the baseline. And after you have the baseline you can start experimenting with some automated solutions or some machine learning model either pre-trained or take the algorithm and train it on your data yourself. Trying to approximate either the human level performance or beat the baseline and after that you keep iterating. Throughout all these steps it's important to keep the real problem you're trying to solve in mind because most of the machine learning projects they take this couple of years from the moment of you have an idea to do something and then bring it to production. Usually your understanding of the goal might evolve but it's important that you keep talking to the business folks to someone who works close to the customer that it still meets what they expect from your solution. And throughout all of this iterate, go to previous steps pivot if you need it can be a long road. Remember OpenAI didn't stop the GPT-2 they went with GPT-3 and the whole hype of the last year I think it was November so they finally delivered something that excited millions of people. You may be asking why we are telling this if the title of the job was not using machine learning and using an ITL 6 step so all that we have talked about in the past minutes is things that we learn by issues. So now please tell us the issues that we have. So we created a baseline, we trained some machine learning models, we defined our methods, so we did all the homework but what we found out is that all of them performed kind of the same. So our simpler heuristic had similar accuracy to decision trees to large language models so we had to find why while we were chasing the reasons why it performed worse we decided to since our heuristic worked and we decided to iterate fast we decided to go to previewed models. It's still production but like the preview feature we decided to go with a simpler statistical model instead of machine learning model because it still automates things, it brings a lot of data even though it's not as fancy as we wanted. Now the theories why actually is many machine learning models on your data may perform the same. Actually the reason is usually the data that you have. So we are currently chasing several theories. So one of the theories that we have is maybe the data that we trained on doesn't have enough predictability. So if you fail a car checkup because your brake discs are worn out if you don't have a sensor on the braking disc you will not be able to predict the braking checkup. So we are currently chasing additional signal, probably metrics in this case. So we are performing some testing. Another one is maybe the data is either not complex enough or not denoised enough. So usually for machine learning models to get an edge over some simpler solution you need lots of data and it needs to be complex enough for it to be able to detect more subtle patterns. So we are denoising the data set and we are getting more data for a longer time frame to see how it will impact the data form. And another lead that we have this is the detective word another lead that we have maybe the labels that we have are incorrect. So with failed car checkup it's very bullion so the engineer says okay you failed because we performed these 20 tests and this one was failed. But how many of you have worked with Kubernetes before? So as you know with Kubernetes and its self-healing architecture it will try to reconcile for a long time. Technically in Kubernetes there is no concept of failure unless you specifically define it. It's like a lack of success for a certain amount of time. So in this case we decided to treat a great failure if we didn't see a great success within 8 hours. But there might be multiple reasons why the upgrade can take more than those 8 hours. For example huge clusters and a lot of worker nodes then customer may set up disruption budgets that prevent the node from draining. So it's a healthy cluster configuration prevents the upgrade from succeeding. So we are currently inspecting if classifying into a great failure a great success is correct. So maybe we should lower the threshold to 4 hours. Recommendation from our upgrade engineers is like they expect the control plane to upgrade within 2 hours. So maybe lowering it to 4 to 2 hours is a good idea. Another idea that we are investigating is maybe we better predict upgrades and successful without human intervention versus upgrades that require some human some operator to go and kick in the tires. The problem with that is this will probably help the accuracy but how do we identify that in the data. The customers are not going to go to us and say well I went there and I fixed it manually. So it's going to be more difficult to create a proper data set for that but we are going to try it out. The last one is in Kubernetes some errors are expected by design. So metrics you will see for example from QBPI during some restarts you will see that 404 is returned from QBPI and this is normal situation during the restarts. So usually we call them flattening alerts metrics when it's coming up for 35 minutes and then it self results. So it's important to either have the algorithm well have the algorithm that is smart enough to be able to ignore those fluctuations or you can pre-clean the data set to explore. So yeah we launched it with a rule-based model instead of a machine learning model and we are not ashamed about it. It was a nice path and we learned a lot of things. We also didn't refuse to use machine learning at all and we are continuing with the studies and we hope to have a proper model working in the close future. Instead of stopping we keep moving and put this model in production because it's perfect. We now have everything in production ready for putting a machine learning model whenever it's ready. So I'm not going to talk about the prioritization challenges but we'll learn in this path. I'm just going to share some advice about there were many other things but let's start with communication it's not just for machine learning it's about any kind of project if you have many groups working in the same project many of them will have some issues, blockers timings and you have to collaborate algorithm in some way if you start coordinating by yourself it can be a complete mess you keep adding communication channels and you end up with this kind of number of channels it's just a simple format but you should take a coordinator calling the pre manager product or whatever person it's really important because it's the responsible of meeting deadlines making sure that when a team commits for a deadline that it's met and also for running weekly meetings or same meetings status meetings so that everyone understands the progress, the blockers and what needs to be done. A good way of making sure that everyone understands what needs to be done is with API specs it's really nice to have an open API spec beforehand before creating the integration or developing the micro series also you can just use a shared document and put the architecture and start reading from from there and also VD test which is a talk that is happening right now so I'm really sorry about it but it's just writing this in a single way with plain English and then writing some Python magic in the background so that everything is automated and so on but it's a really nice way of making sure that others understand what you are working on second suggestion that we don't do any project you need to choose speed versus perfection we decided to choose speed it's really worth it if you are trying to compete with other companies you need to be faster than them and perfection can be achieved in the future you just need to iterate and try anything so if this feature never gets interested from our customer that's it we just spent a couple of months and that's it we released it and it can be a dream for everyone that's it because we speed this in the DevOps mindset we use an internal Rehab tooling called that let us find dashboards, microservices databases with just a couple of gith webflow lines it's using in the background Terraform, github, OpenShef it's interesting that it uses OpenShef because we are collaborating in OpenShef it's a close loop another really nice tool that we are using are ephemeral environments it's just a way of spinning up OpenShef clusters but just for you not for customer-facing APIs so you just spin up an ephemeral cluster for example just an hour copy the production or stage environment and put it in a cluster to your machine and then you can start coding and testing your changes there this is useful for locally testing things but it's more useful for CACD all of our third test runs on Rehab clusters so we are safer to deploy to production and the last advice that I want to tell you is scalability if you want not to touch this service isn't featured you need to think on scalability beforehand what we do is put in our proxy between our solution or the customer so we call it proxy for example it's just a stupid microservice there's the user request add some authentication or whatever mechanism you want to add but then you have real stuff in the background you have first data engineering which is gathering some data for data enrichment and data engineering also preparing the data for the model because each model may need a different input and then we have the inference when we put the model now it's running a statistical model but it's ready for running any material model and the good thing about this architecture is that everything is stateless so you can scale it whatever you want if the number of users increases a lot you can scale the inference part or the data engineering part also if you want to go with a huge model like again chat gbt you may need a lot of resources so you may need to scale the inference part so it's very comfortable to have something like this to monitor and simply to scale so rubbing up sorry for all this stuff before starting a machine learning project you need to make sure that you have your program statement statement pretty clear you understand everything that needs to be done you have the data for training this model that you are bringing some business value to the model and for the acquisition part again you need to make sure that all of the teams that you have understand and have the same priorities remember the trade-out of speed and perfection so better go to market faster with something that works better than your competitors or something that keep iterating so that you can reach perfection later and same about scalability make sure you have the architecture in place to be able to scale you don't have to implement it right from the start thank you for your attention and questions how many rules did you end up with roughly three lines of code wow that's the logic the statistical model is same kind of longish lines of code but it was months weeks talking with subject matter experts exploring the data but it's very cheap to operate but we're not going to stop there I'm facing a similar problem so to say and we're not starting to implement that so I wanted to ask you how much time you invested in the machine learning model before deciding to move on with the field approach we were not alone in this we were collaborating with the IBM research team so the real data scientists were we're engineers we double the data science but we're the engineering background so they were working for two or three months trying different things they were trying decision trees because this has to be explainable that was one of the requirements because we need to tell why exactly that's not always possible with machine learning models so they tried decision trees which is kind of explainable they tried traditional models or the large language models kind of like GPT so two or three months and then we made a decision in February to the first release with a similar model and we keep working in the background you mentioned one of the training set was not large now so can you quantitize like how big the data set was we took the data for the last year the problem or if we talk to OpenShift Engineering they will say it's not a problem the problem for us is that not so many upgrades for you so we had less than a thousand examples of failure so it was highly imbalanced we tried to balance it out by excluding successes or pre-classifying it so we need to so if we lower the threshold from 8 hours to 2 hours we will have more failures but we are not sure if that classification is enough was there some recommendation like 10,000 would be good enough or we didn't do that analysis is working or simulating failures be kind of way to increase the size of the database or the data setting we need to know so it's like chicken and egg problems we need to know what causes failures in Kubernetes there are thousands of metrics hundreds of alerts so we ourselves don't know exactly what can cause the upgrade so data augmentation is possible but we need to think really carefully that we don't synthesize the data incorrectly what was the reaction to the manager but you solved that the manager is here I think the reaction was now the manager perspective so we still treated it as a success story because we've learned a lot we've learned a lot about the data we found out the trade-off perfection versus speed and we have several leads to make it so we hope next DevCon will be how we finally switch to machine learning thank you