 Okay. All right. Well, good morning. Thank you for those of you who've joined me here today. We're gonna talk about AI and AI ops this morning. We're having a little bit of fun with it, so I'd like this to be interactive as possible. I'll have an opportunity at the end for you to offer your own feedback and your own thoughts as we get through this. And for those of you with some interesting ideas, I've got a few hats to throw at you. So if I can keep you awake for the next half an hour, hopefully I'm doing well and we can move this through. So who am I? I'm the global field CTO for Morantis. I've been suffering around with technology for the last 20 odd years, building enterprise systems, everything from public cloud solutions to helping companies design financial systems. My colleague Nick Chase, who's not with us this morning, helped me put together the slides and the deck that we're gonna use today. He is more of the expert on this topic than I can ever hope to be, but I'll try and do my best to do it justice. So we're gonna talk a little bit, what is the current scenario when we're talking about operations today? And what is the problem we're trying to solve? Little bit of background, what is AIOps? What does it mean to us? Some of the use cases, and I'd love to hear from you when we talk about these use cases, where you see this going. I've obviously got strong opinions on this. I'm sure everybody else does. Obviously limitations, you know, TLDR, we're not quite there yet. And then what can we do now to get us there, to prepare the way for us to implement effective AIOps as we go into the future? So the current scenario as we see it, what is operations today? What are we looking at right now? For most of us, it's some combination of humans at keyboards, working shifts, dealing with problems as they arise. We are seeing more and more intelligent automation come in. Companies like ours and many of us are implementing automation as deeply and as much as possible, but it's still highly, highly reliant on humans delivering work. And obviously humans are fallible. We all know that. Even more so than the machines we look after for a living. You know, the problems that we're dealing with on a day-to-day basis. This shouldn't be new to anybody here, I would hope. If you've been in this open-in for an open-stack world for a while, we're dealing with network disconnects, maintenance updates, software bugs, software bugs and more software bugs. Most of our livings here. Stupid things like certificate timeouts. How do you deal with a certificate timeout in time? Security breaches, the obvious things. This is what the state of operations is today. Part of that is we're so reliant on people being available to deal with these challenges when they arise. I look at companies that are trying to deliver operations as a managed service. And I look at the challenges that our customers have, the people I talk to in the industry, just keeping staff available, keeping staff on 24-hour shifts. You know, one of those determinations we see is to handle a single 24-hour shift, 365 days a year, for one SME, so one subject matter expert, I need seven individuals. Give people time off, holidays, not have people working 12, 13, 14-hour shifts, because the more shifts you work, the more tired you are, the more mistakes we make. Eitel. How many of you here have dealt with Eitel within your lives? Show of hands, because Eitel is the bane of my life. I mean, I've been dealing with it from the early days. This is all the things we have to take into account when we're talking about the implementation of operations services. So what would be nice? Where could we take this? The dream vision is we automate and use intelligent automation, AI ops, or machine learning to handle all these tasks that quite frankly are fairly routine but need to be dealt with by humans. And the more we can take out humans from dealing with routine tasks in those day-to-day operations, the more likely we have that those smart intelligent people can be adding value to the business as we work for it. That's really the key value, as I see it, of the move towards intelligent automation, AI ops, and machine learning within our operations environments. Obviously, there are many, many other applications where AI can start to bring value to us. We won't get into code. We won't get into all of those other options. But let's just focus now on what AI ops is, is actually to us. So what is AI ops? At the end of the day, operations are just actions that we take. They're just a reaction to an event. But the real question is, how is the decision made to do what and when? It's very easy for us as humans who have been training at something for 30 odd years, have degrees in engineering, have broken the systems multiple times and learned from our failures, but how do we teach a machine to do that? Today, all of these decisions are made by humans. Pretty much all of these decisions. Even the process of coding it manually to go through a series of steps is still done by humans, which means it's fixed by and large or involves a human in the cycle. So at some point, the system stops and asks for input. We wanna get away from that. We want to get away into a position where the machines can make the decisions on our behalf so that we're not getting that text message or that Slack or that phone call at three o'clock in the morning. Or we don't have to have people working at three o'clock in the morning to get away from their families, children dealing with the real world. So the question is, what is AI versus ML? So AI is coming today. Technically, there are three categories I'm gonna talk about to talk about weak AI and strong AI. Weak AI is just intelligent behavior modeled. It's if then statements, maybe a couple of case statements. If those of you who can remember writing case statements, it's a couple of if then statements. What we're doing is we're modeling a response to a determined set of problems. Strong AI is where the machines start being able to make decisions in their own right. Now I'm grossly simplifying this for the purpose of today, but it's still fairly hypothetical. We don't really have true strong AI in the industry at all yet. It's being developed. Siri, Alexa, all of those things are weak AIs. There are programmed responses to a certain determined set of information. Now, machine learning is a subset of AI. It's a feeder into more advanced AI models. The big thing about machine learning is we're taking a set of data. We're providing a formula. We are providing a formula. We're not letting machine determine a formula for us at this stage. And we're building a trend from that. That trend will allow us to make decisions based on that data. That's the progression that we're talking about. So AI, it's literally machine intelligence. It's machines trying to mimic what we as humans do and that cognitive capabilities that we're able to take to make a call on something. Today, they're typically considered expert systems. We train them to answer these specific questions. This Siri or the Alexa type examples. There are others. And we are reliable on training them or reliant rather on training them with fallible human logic. The basic simple step logic, that Sean is human, humans are mortal, therefore Sean is mortal. My liver tells me that definitely. Sometimes I feel more immortal than I am, but I'm not. And as I mentioned, AI is a superset of ML. And today we're gonna focus more on ML to try and understand how we can take ML forward in our way of thinking about this as we go ahead. So as I mentioned, machine learning is a determination of a trend line in data so that we can have computers make decisions on our behalf. So we're looking at a data set. We're saying there are pretty standardized trends. We apply a formula to the input data and we'll get a graph. Most data that we deal with today, most of what we're dealing with making decisions on alerts is straight line trains. We know A plus B equals C response. Obviously, do we need machine learning for that? Probably not because we already know that A plus B equals C. Where it starts to get more interesting is where we have more complex data sets where more complex formulas based on more complex data inputs can change a graph and we start looking at how we can compound information across multiple different systems. And this is where machine learning starts to get a lot more interest to us. So if we look at an AI and an operations world that we're all in today, it's the correlation components. It's dealing with small, tiny changes that a human would not see because we just can't hold that much information in our heads but the machine over time can be taught to pick out those little nuggets of information, bring them back to the formula and apply them over a broader spectrum of information. That allows us to make a prediction of what could go wrong. Now, obviously, if we can't predict what would go wrong we can't build a formula to pick the correct data to predict what was gonna go wrong. Well, we probably can't train a machine to do that. And this is where we get into the problems of what we're trying to achieve today. So how do we do this? Typical machine learning, of course I'm saying typical, we decide an algorithm, we pick an algorithm, we pick something that we're looking for. We have to have an endpoint in mind. We take that algorithm and we train the model on predetermined, fully understood training data. Now, this is where it can get really interesting because once again the human factor comes in, we might think that that data is fully understood but it could bring in anomalies that can make a mess of our model. I was gonna say it differently there. We then have to test that model. So we train the model, we test that model by feeding in data that we know is either good or bad that's gonna lead to a specific result. Now the problem with that once again comes, how sure are we about that data? Now what I want to bring in in a second and I'll talk about how we apply this in the real world but the final step is take that model and at least start by alerting. We have to have an event so the model essentially just gives us an output event and alert which we can then action. Those actions are the next generation for us. Right now, those actions that we're gonna take out of our ML models are really just predetermined. We're gonna go, yes, disk is gonna run out in 24 hours, let's add more disk. We haven't yet reached the point where we can have machines decide, well, let's not add more disk, let's remove some data. That's a much more complex decision tree that we would like to get to with the machines but right now we're focused on an initial stage of, let's just find the problems. So in the real world, it's much the same. To repeat to the previous slide, just say differently. We take that algorithm. Now by now we have gone through this process of building our algorithm, modifying our algorithm probably 100,000 times because we're going backwards and forwards, constantly iterating here. We're going to train that model so a company can deliver you an AI solution or an ML Ops solution based on a formula that they've trained on their data. They've got to bring that into your specific environment and train it on your specific data sets. So there's no such thing as a silver bullet of here's the ML tool for everyone and it'll start working tomorrow morning. Blankly if somebody walks in and says, here's an ML tool for monitoring your network and it'll be working for you by tomorrow morning, I would be super skeptical because the reality is every single environment is different. There's some great companies out there doing really interesting things with machine learning but all of them and all of the good ones, when they come on board it's a six to eight to 12 week training period before they will commit to any results. We then have to go through the process of testing the model against data that's known good or known bad within your environment. So how do you take your environment and actually simulate an outage? You've got to go and create an outage to test your model. You've got to be able to do that so that can be quite disruptive in giving us ready for this. And then obviously, once we've got a working model, we've got to constantly monitor it and check it and make sure it is responding the way we expect it to respond. Okay, so digging into some of these use cases, ultimately the way we're looking at it, the use cases are built around three core domains. We've got detection, prevention and remediation. From a detection point of view, that's a lot of what I've been talking about is how do I find out what's wrong with my system? What am I trying to look for? What are my formulas focusing on within my environment? And it's everything I mentioned earlier, those network outages. It's, I have a network failure in my branch office that it's just packet loss. Whenever that happens, I'm noticing in the data center I'm getting a slow down on processing of applications. What's the correlation between those two? Is there even a correlation? How do I look at that across this whole domain of all of this to make sure that I'm getting an effective answer that I can respond to? That's the detection component. The other big area around detection is the aggregation of smaller issues. It's looking at issues within a much more complex system, Kubernetes logs, OpenStack logs. Just look at the Nova API logs in verbose mode and the sheer volume of what's going on there and finding those little nuggets that you can correlate in an effective way over the last 48 hours without building massive elastic search solutions. So it's all of that, bringing all of that together in an intelligent way so that we can respond to these things. One of the most interesting and probably the most current application of real world ML machine learning and AI ops is in the intrusion detection space. There are some great companies out there doing some really interesting stuff where they have built solutions that are looking at all the data within a network environment, correlating that and looking at the patterns being used to determine security intrusion. So it's a non-traditional way of looking at intrusion. It's not just looking at is someone trying to, you know, a certain series of packets very quickly doing a port scan. It has the pattern changed within the network of the standard networking traffic. Now we start getting really interesting on what we could do because if I can take that and I can respond in real time by just throttling that particular source or even cutting it if I need to, I can respond far faster and far better and prevent intrusions. Same with the example of packet loss. If I can determine that the packet loss is actually causing the slowdown inside my database engine because of a Kubernetes app that is failing to communicate correctly, I can kill that process and speed up everything else. And that's the beauty of where we can go with this. And that's where we start talking about prevention. Prevention is maintenance. It's the general maintenance that we all do every day. Our operations guys go in on big open stack clusters and go and check, do we have enough database tables left? Do we have enough storage space left? Are the logs filling up the drives? You know, are we hitting limits on connectors? All those things are easily changed if we know they're happening, but they also take up resources if we just turn them all the way up all the time. So now we can be a much more dynamic about our resource utilization. Workloads. We've all heard about the dream of really using spot instances on Amazon. But the reality of spot instances is unless you have something intelligent dealing with them, most of the time they don't really save you very much. But if you can intelligently determine spot instances are gonna be available through the prediction data from ML, now we can start to take more value of those things. Remediation is the area where I feel long term most of the work we need to do is going to be. The detection side is data science. This is a known problem. We've been doing this for a long time. Most of us understand that you can correlate data. And most of the work right now and the intelligent work that's being done in machine learning is happening in that space. The remediation is where I think the biggest and most interesting work is going to be done. And I say going to because I don't personally know if anybody who is doing much in the remediation. True machine learning or AI based remediation. We wanna get to this point where we can kill unused workloads, find the alternate traffic routes. More importantly, we wanna get to the point where we're not reliant on a human being in the path of the decision-making process or at least a human is just an approver for a decision-making process. Now, as the way of getting there, I know the way we're looking at determining and building this out is that we will have humans approve decisions until such time as we trust the machine. We can get into long stories about ghosts in the machine here but ultimately that's where we're going. Now, let's talk a bit about the limitations. I've mentioned some of these already. We have limited inputs. We are a big part of the limitation. The next slide we'll get to that. But right now, what we need to train these models is an enormous amount of data. That enormous amount of data is very hard for us to understand, for us to find what all the parameters we have to deal with are. Network traffic. I mean, I grew up in the networking world. I can remember when a large network was 40 or 50 nodes. I'm not talking data center networks, I'm talking wide area networks. Data center networks are a different story. Nowadays, every network input is our cell phone. That's part of the network. How do I deal with that huge volume of different parameters and bring that into a system? We've got to get our heads around that. We have so many systems providing input. Prometheus, log systems, Kubernetes. All of these systems, we've got to get our heads around what are the parameters that are important to that so we can train our models. We need access to more in-depth info. Many of us here at these events are infrastructure people. We think about everything from the point of view of infrastructure. But infrastructure is only a small part of what we need to look at and have data from to effectively impact workload uptime. We need all that information from inside the workloads. Tracing, as an example. So how do we bring all that information in? We're starting to talk about data sets that we have to store. We need, in some cases, we need bigger systems to run the AI and ML component than we're actually running the workload. At what point does this start to become daft? We are a problem, okay? We're trying to teach a machine to imitate us. We can't even teach ourselves a lot of the time. So we learn more over time the same way we're trying to teach these machines. We can then add the parameters. How do we teach the machines to add these parameters? It's the big question. And that's the problem we need to solve. The volume of historical data needed is, quite frankly, one of the biggest challenges that we're seeing in the work that we're doing at the moment to train models. How do we deal with things like distributed learning? How do we bring that information together in a structured way? Again, there's some really interesting work being done. There's some great startups here in Germany who are very much focused on the distributed learning model companies like Flower and others. So really cool work happening in that space. I'm very excited about what's coming there. So I'd love to hear from you guys what you're seeing. And of course, there are many things we haven't even thought about yet. Routines can't tell us everything, okay? Some things we as humans just infer based on our cognitive capabilities. And we don't really know how AI works. We can't really explain this thing. You know, we can look at the outputs, but in many cases, we can't really decipher neural networks. We don't really understand how these things are coming together, how we're training these things. I'm sure you've all heard the stories about biases in neural networks. You know, how do we deal with biases? We are inherently biased as humans. How do we as a biased individual train a machine that we don't understand based on our intelligence that we don't understand to remove bias? This is the biggest challenge between and I have moving forward to true AI. I'm talking about true AI here as opposed to machine learning, you know, less neural components. And ultimately, this is still a black box. I threw that image in, you know, we need to untangle that. We need to untangle the neurons. We need to try and understand where we're going and we need to be very deterministic about this. Causality is still in its infancy. We're still trying to understand this. We can't necessarily explain why something causes something else. There are, there is work being done in that space. The libraries such as do why. There's more and more information starting to become available within the public domain. And I think what we're gonna see is we're gonna see the acceleration of this learning because of the fact we can do this in the open source domain. Bring it back to why we're here. This causality component is key to that remediation. If we cannot learn to understand causality, we won't be able to do remediation. And for many of you know how difficult it is just to teach people to go through the process of doing proper troubleshooting. Now we've gotta train the machines to do that. And that's the challenge we have for the future and I think that's the most exciting thing we've gotta do for the future. So what do we do now? What do we do to help us get to this hypothetical, blue sky future where the machines are running everything and Skynet has arrived, hopefully without the killer robots? Let's see. I'm gonna keep this really simple. We need to start focusing on building out as intelligent automation as possible. We need to make that automation available as widely as possible. So the same way we automate, the same way we build tools in the open source domain, we need to build and make that automation available to everybody so that everyone can have input to that. Let's build this automation in the public domain. Let's not forget though, we need these intelligent humans as part of the mix. And we need to understand how humans are going to interact with our automation. That's a core function here. If we cannot get the machines to include the humans, the machines can't learn from us. So as we build out these technologies, we have to focus on how do we interact better with the meat puppets. All of that is built around strong alerting, strong data collection and creating a environment where we can safely collect all of this data and potentially anonymize it so others can get use of it. And this is where things like distributed learning comes in. We can learn in multiple environments, share the results of that learning and then work off that results of that learning. The more we can share that strong alerting information and that learning, the more we can all analyze it as a collective to be able to evolve. And that will allow us to be continuous and to evolve. I wanna leave though with the one thought here, we're not advocating to get rid of humans in this mix. We're not advocating the replacement of humans. We're advocating the augmentation of humans to make intelligent decisions so that we who have this cognitive capability, we who can make leaps of intuition can do the more interesting things. What are your thoughts and where is gonna take you? Any questions that I can answer, we've got two and a half minutes left. The things I've seen here today is the infrastructure, the cloud you have is a dynamic beast. You constantly getting and removing stuff. You get new customers, you get new workloads. When you're training in AI, if you change any parameters, you might not have the results you're looking for. And that's the biggest challenge which we have today because we don't have static environments to train in. What we're finding and a lot of the conversations I've been having with researchers in this space, Nick is one of those people, is we have to set up environments that are static to do the initial training. The trick is then moving those formulas into more dynamic environments and monitoring them. And that is that core where I made the comments about taking six, eight, 10 weeks to train a model. Reality is you could train a model for two years and you still wouldn't catch every single possible parameter. So really good point. Anything else? Otherwise I'm just gonna throw hats into space and see if you all jump and scrabble for them. I think silence. All right. Well, thank you very much. Appreciate you coming in to listen. Hope I didn't bore anybody and put anybody to sleep. Enjoy the rest of the summit and we'll see you all around.