 Hello, everyone. Welcome to Cloud Native Live, where we dive into the code behind Cloud Native. I'm Annie, and I'm a CNC ambassador, as well as a senior product marketing manager at Communda. And I will be your host tonight. So every week, we bring a new set of presenters to showcase how to work with Cloud Native technologies. They will build things. They will break things. And they will answer all of your questions. So join us every Wednesday to watch live. This week, we have Andy talking about power up your machine learning. And that is really looking forward to this. Really great. And another thing that you saw in the banner and you can see in the little banners that are in the screen as well, remember to register and reserve your spot for CubeCon. Plus, out of NativeCon Europe, now is really time to secure your spot. And as always, as a housekeeping item, this is an official live stream of the CNC app. And as such, it is subject to the CNC app code of conduct. So please do not add anything to the chat or questions that would be in violation of that code of conduct. Basically, please be respectful of all of your fellow participants, as well as presenters. So with that, I'll hand it over to Andy to kick off the presentation. Thanks. I was thinking to myself, break things where we are actually going to try and break things a little bit today as part of the anomaly detection, but not so much that the demo breaks. So that's the plan, hopefully, fingers crossed. So hey, I'm Andy. I work at NetData Cloud. And I lead the analytics and ML capabilities there. The purpose today is mostly just to show some of the first beta anomaly detection feature that we have in NetData Cloud. And also, hopefully, just have some discussion towards the end about ML in general, ML in terms of the observability industry, different trade-offs of different approaches, pros and cons, as we're kind of after we finish the demo. I think there's a link to the slides in the chat. If anybody wants to actually just kind of follow along the slides, there's a bit.ly link as well. You can kind of, if you can type in bit.ly netdata-cncflive-deck. So I will kind of, I've got like one sort of set-up slide. Let me first kind of, I've got one slide, and then we will kind of go into the demo basically. So the main kind of goal here is to talk about anomaly detection. And there's lots of different ways to frame anomaly detection to lots of different ways to tackle the problem. And I won't go into too much detail. The high-level way to think about it is just a simple question, does my data look strange? And then there's hundreds of ways to kind of take that question and implement some sort of product or solution. And on the screen is just some kind of examples of different types of anomalies that we might come across. So it's the first thing people usually think of is kind of spikes in their data. But it's not just spikes, it can be, it can be lots of different types of patterns basically that look strange. And that's kind of a core to how we've approached this, is that we're not just looking for like single high values or single low values, we're actually looking for strange looking patterns in the recent data. And that's the kind of, that's the main aim here basically. And I have, there's one more slide here which I won't even really go into, which is a lot of detail here, but this is basically how we have taken, the first question here, does my data look strange? The next slide is basically here is how we have taken that question, that sort of high-level messy question, and actually how we've formulated the problem and how we've actually implemented it. And this is just in terms of general ML discussions. This is always the trickiest part sometimes where you have a high-level question and you need to, how do you actually formulate that and how do you operationalize it as a machine learning problem? So this basically is our kind of medium level detail slide of how we're approaching it in the NetData agent. And there's more detail in this deck, which if anyone wants to go into more detail, there's more detail there. And so I'll get straight into the demo now because I kind of want to try and get into the demo first and show people kind of the feature and then save time as much as possible for questions. But if anyone has any questions at any time, I'm happy to kind of stop and take questions at any stage. Just know, the more questions, the better. I'd like to kind of have some discussion as well. So the plan for the demo is to, I'll do a quick kind of overview of what NetData is, what NetData Cloud is, what the agent is, and then I will jump straight into a room that I have set up and do some chaos engineering. So this is the break things part. We're going to use Gremlin to actually trigger a chaos attack on my nodes and we'll just watch how that plays out through the anomaly advisor now to see kind of, to see the difference between like a traditional approach where it's kind of needle in the haystack versus a more modern approach where we're using the machine learning to surface the insights. That's the main goal of the anomaly advisor. So I will jump out of presentation mode and firstly, a little bit about NetData just really quick. So we have an open source agent that does monitoring and it will monitor anything that can run C basically. So servers, IoT devices and everything in between. The NetData agent can run on the node, collect the metrics and the metrics are all stored on the nodes so there's no kind of cloud centralization or anything like that. And you can have all your data kind of sit on, all your monitoring data sit on your agent mostly focused on metrics. And then we have NetData Cloud that kind of sits on top of the agents and basically brings them all together into sort of a single dashboard for all your nodes but it's done in a sort of federated way such that there's no data centralization points and NetData Cloud kind of can just travel your agents and pull them all together basically. And so in here I have a room, I have a space set up for CNCF Live and I have a room, every space has a general room. I will jump into this specific room I have here for just the three nodes that I'm interested in in this for today. And so what we see here is we've got basically a dashboard of hundreds of charts and thousands of metrics basically all at the second level. And so this is all coming through the agent in real time. And so I've got three nodes in here, CNCF Live one, two and three and we have some crown jobs running on each of these nodes running CPU stress tests basically. So there's like CPU work going on. And this is the traditional kind of monitoring approach where you have a dashboard, you have, you know, we have kind of lots of different categorizations and semantic categories of charts. As of each node basically has typically out of the box about 300 charts and 2000 metrics or so. And so there's lots of data here basically. And the challenge sometimes is how do you kind of make sense of it and how do you actually, you know, how do you know where to go to straight away basically? And oftentimes, you know, traditional approaches you have, you maybe have alerts and you have some ideas yourself of you have a theory of what it is you need to troubleshoot and you come in here and you kind of click around. And that's kind of, that's the way it is generally still in observability. And so some of the idea of what we're going to discuss today is to compliment that approach basically by using machine learning. And the whole idea or team of a lot of stuff we're doing here is to use machine learning as the UX basically. So, you know, the traditional approach of, you know what you want to look at, you know what chart to look at. You have an idea, a hypothesis in your head and you're kind of exploring iteratively. That's all still perfectly valid. The idea of the anomaly advisor, which is this anomaly tab is to take a different approach, which is to use the machine learning to surface up the charts and metrics that maybe matter the most or that maybe most anomalous. And so I'll kind of, I'll get going. The, if we look here, we can see, so we've got these three nodes and we can see this in the overview page in that data, we have like a summarization, which is an aggregation basically of all these three nodes. And so if I group by, let me look at the last 30 minutes, say, and if I group by node, we can see, and maybe I'll make that a line. So I can see here, here's the actual, like the overall CPU usage of each of these three nodes. So you can see, you know, the orange line is around, you know, 12%, 20%, the purple line, pink line is up between, you know, 40 to 50. So the other two lines are up at a higher level. And that's basically, that's, I have this all kind of configured based on a cron job that kicks off stress jobs. So if I look at the, I can look at the applications menu here and I kind of, I know what I'm looking for here. So I'm interested in the stress app. And this is all kind of auto discovered by net data. So the setup here is that I have basically a, let me make that a line as well so that they're not on top of each other. I have a cron job that sort of does, on each node, there's a cron job that runs the stress NG2, which basically just in, you know, for this one here, it's like every three minutes, it's going to take 30% of one CPU and it's going to do that for 160 seconds. And every two minutes, it's going to take another 30% for 140 seconds. And so each of these nodes has, you know, some CPU stuff going on, basically, that's the idea. Rather than, instead of doing a demo where there's just noting happening on the nodes, I wanted to do a demo where we have some CPU usage and then see the impact of, you know, some other attack we're going to make, basically. So that's kind of a good part of the whole kind of ethos and approach of net data is that it's kind of low configuration, zero configuration. So when, you know, I didn't have to do anything to have this stress NG, this stress application get picked up, it's kind of auto configured out of the box that if you're running that, if you're running stress NG tool or if you're running my SQL or Docker or any other tool, basically, net data should recognize that and give it its own application. And so that's how you can kind of see which different applications are behaving on your machine. And that also applies to containers and stuff which we might get to later. So enough of that, that's kind of a quick overview of net data cloud. The main, what I want to do is look at this tab today which is the anomaly advisor. So the main goal here is it's a similar approach where you have some summarization charts but the summarization charts are now based on the new concept we have in the agent which is the anomaly rate which is every second the net data agent is collecting all the raw metrics but it's also now producing basically ones and zeros for those metrics to say, if it sees something that it thinks is anomalous it produces a one. And if it sees something that looks normal it just leaves it as a zero basically. So what I'm gonna do is I'm going to jump over into Gremlin and actually just kick off an attack. So I've got my two hosts in there. I have the Gremlin agent running on these two hosts as well. And so I'm gonna kick off a chaos tack and we will do a resource attack and maybe memory. So what we're gonna do here is we're just telling Gremlin for 25 seconds, take two gigs of RAM on each node. So this is kind of equivalent to something bad happening maybe some sort of memory leak or some misbehaving app that all of a sudden just starts taking much more memory than it usually does. So I will kick that off and in the background we should see Gremlin is getting ready to get the Gremlin agent will now fire up and do sort of do its attack. And I will flip back over into NetData to have a look at this live. So in the anomaly tab, what we have now is we can see basically a sort of a spike coming through. So let me actually, let me go last 15 minutes. So it was a bit zoomed out there. So what we have here is we have a big jump a spike in the number of anomalous dimensions on each of our nodes on CNCF live one and CNCF live two. So what you can see here is that as the attack, let's see if it's finished. It's kind of still ongoing. I think I'll give it, yeah, it should be finishing now. As the attack plays out basically we see a jump in the number of anomalous metrics. So this chart here shows counts of anomalous metrics and metrics and dimensions is kind of interchangeable in NetData. So this is basically saying that, let me just play it out a little bit more as well. So I pause it. This is saying that, at this time step here, 17, 13, 39 seconds, there was 50 dimensions, 50 metrics on CNCF live one and CNCF live two, they were considered anomalous by the model. So that's a sign. This is the idea here is to show you in across the timeline, which period had like an elevation in anomalous metrics. And the chart below is very similar. It could be in this case, it's my nodes have the same amounts of dimensions. So the counts are kind of similar, but it could be that they might not have the same number of, they might be monitoring different things. So the actual count of dimensions might not be enough. It's usually, maybe it's the anomaly rate which you care about. And so the anomaly rate corresponds to, at this particular second, 39 seconds past 17, 13, the CNCF live two node had about 3.2% of its metrics were considered anomalous. And the CNCF live one node had about similar about 3%. So we see basically a jump here on both nodes to about 3% or so. And the chart chart on this screen is basically a higher level aggregation again. So on the net data agent itself, if we see, if the anomaly rate basically stays elevated for a non-enough time, the node will produce a node level anomaly event. And that's what this is here is telling us, which is basically, if the, it's basically like a rolling average of the anomaly rate, if it goes up a pass a certain threshold and stays up for long enough, then we're gonna trigger an anomaly event. So this is one way to do this, is to just come and look at this towards green and see if you see anything. And if you do, then that's usually a sign that it's an elevation on a single node of the anomaly rate. So the idea here is that this has gotten us to the point where, okay, we see we have a problem or we may have a problem, these could also, it's perfectly possible if these could be false positives. This is also a big part, getting back to the ML. The ML will tell you something looks strange, but the question is whether it's strange in a way that you care about it or not is a whole different question. And this is why a lot of the stuff we do, there's always that sort of human in the loop approach where we're not saying, we can make any judgment on the anomalies. We can just say, this looks anomalous and then it's up to the human to decide whether it's something that they need to take action on or not. So here we can see basically an elevation in the blue and the red. So I'll just filter for these two guys because these are the nodes that I seem to have a problem on. And what I wanna do now is so these first three charts in some way kind of tells you, okay, between 1713 and 1715, there seemed to be some sort of elevation in anomalies. And what we wanna do now is we wanna know what was it that was anomalous basically? And the next kind of the main way to use this is we highlight a region of interest, which is a general way within that data we interact with charities. We highlight regions of interest and then you get sort of context specific help. So here, once I highlight this region of interest, I now see that on below the top three charts, I get this kind of table of sparklines basically. We still haven't come up with quite a good name for this. Originally it was a heat map, but it's now turned into a table of sparklines basically. And what this is telling me is each of these kind of green lines is the anomaly rate for a particular metric of interest. And so what this is saying here is I can see straight away that apps Elrite Gremlin was in this highlight window, it was considered anomalous 57% of the time. So or you can say that the anomaly rate was 57% basically across all across these two nodes basically. And so the idea here is you can kind of quickly scan what things went anomalous when. And so here's an actually interesting one because you can see net data user here. This was probably me when I was on the overview screen triggering calls to net data that I haven't been on this node in a few hours. So that's a good example of where sometimes you might get a mix of things going on at the same time, but because you get it over time, you can kind of see how they've evolved over time. So I can clearly see at this point here, Gremlin started doing lots of work. And let's find something interesting here. We can see if I find kind of, it's a lot of Gremlin. Because Gremlin was automatically kind of discovered by net data, it's a lot of Gremlin stuff. Here's a nice one here actually memory available. This is a nice kind of, because this is one of these high level metrics. And so we can see here is actually on both nodes, the memory available was steady at about 2.5 gigs each. So actually I let Gremlin take two gigs, which was probably a lot because these are small VMs basically. And you can see that actually the memory available jumped down as soon as Gremlin started doing its chaos attack, the memory available dropped down to half a gig or so on each. And as that memory available dropped, the anomaly rate jumped up basically because we've never seen a drop like this in the model, in the data that the model was trained on. And so this is where you can kind of just quickly at a quickly glance, the idea is we should filter, filter the dashboard, filter all your metrics into maybe the top 20 or the top 50. And if you can quickly scan within these metrics and get a feel for, is this something you care about? Yes or no? That's the main idea here. And we've kind of solved down, solved the search problem basically by using the anomaly rates to kind of filter or sort your metrics and just show you the ones that we think looked the most strange during this window. So that's the kind of, that's the main idea there. And so I let it run for a little bit more. That's the main, that's the majority of the demo. I'll just take a quick check and see if there's any questions or anything like that, because I'm keen to. Not a no question so far, but there is a comment, awesome, excellent graphics and analysis and a lot of hello's from everyone. So hello to everyone from Peter, Hill and everyone. So glad to see you all here. Yeah, if you have any questions, just put them to the chat. Cool, well, I have another sort of, I've another demo as well, because I wanted to show just a different type of demo. Oh, now there's a question. It always comes a bit later than what we imagined. So machine learning at the edge sounds cool, but how do I know if my IoT devices can handle it? Good question. And that's actually a great question. And it's a good, good chance for me to have a look. So let me pause here. So in terms of the overhead, we can actually, we can have a look here and see on each of these nodes. So if I go to applications and if I say, give me the last six hours, we can look at the CPU overhead by for just net data say, and let's just see. Because there's a few options, basically. But that's the sum, that's, let me one select that guy, yeah. So you can see here kind of a few little peaks when I was actively querying net data from the dashboard, but generally it's taken just over 1% if even of one CPU on these machines, which is a, this is like the lowest level GCP EVM. So we've built it to be as lightweight as possible because that's core to kind of the whole approach here is that the whole idea here is actually instead of just taking the raw data and just displaying on the screen, we're taking the raw data and we're just kind of learning a little bit from it and doing a little bit of tiny work to also give it as these ones are zeros, which is the anomaly bits. And so typically we have a lot of configuration options in terms of when you're setting up, when you're enabling the ML, there's lots of different configuration options you can do. So you can have it to only train at a longer window. So you can tell, okay, only train every four hours. And then what it'll do is it'll spread the training across that four hours, or you can say only train on these specific charts. So there's about five or six different levers that you can use in the configuration to basically, let me just jump so you can see real quick in the readme. There's lots of different configuration options where as a user you can kind of lighten the load basically. And that's kind of these guys here where you can tell it, okay, well, I only want to train every, you can make a longer training window, which would mean it would spread the cost of the training out over a bigger window. Or you can train on less data or we have a few optimizations around like, you don't have to necessarily train on all the data, you can randomly sample say 20% of it or 10% of it and still get a useful enough model. So these are ways where actually my preference usually is to try and do this on the edge. But if it gets to the point where that's just not possible, we also have net data agents can stream to parents. And so this is, sometimes people like to start with enabling this stuff on parents as well. So, and I have, there's a launch post on our community forum that's in the deck as well, that has an example of a configuration you might use for that case. So the example configuration here was I have a parent and say you have three IoT devices, you could easily just have those three devices stream to the parents. And then all you would need to do would be to enable ML on the parent and it'll automatically then do the training on the parent for the data that's streamed in. So there's no ML happening at your edge then. So typically for IoT devices, that would probably be what I would recommend. Because if you might be able to run it, like for instance, sometimes when I run it on my Raspberry Pi, it might take maybe 3% CPU on my Raspberry Pi with the default turned on for everything kind of behavior. So for IoT, that might not be enough. That might be too heavy basically. And so for IoT setups, you might want to go with a parent approach where you actually just stream your metrics to the parent and then the ML happens on the parent basically. And you don't need to store these metrics on the parent because at all, I mean, once it's trained, the models are there. So you're still not necessarily having to centralize all of your data in one place. You're only kind of streaming a true to parent and the parent will learn from the data and then apply it in on the score. So for IoT stuff, that's probably what I would recommend trying first. But you could try it on the edge depending what the device is, I guess is the big question there. So yeah, good question. And that is something that we, that was a core part of kind of designing why that's why we use K-means as the model under the hood. We use unsupervised clustering because it can be done very cheaply and efficiently in the C++ code in the agent. And so one of our biggest kind of main things we always try and try and think of the impact of is the ML can never take too much impact on the agent it's monitoring. And so typically 1% or so CPU of one single core is kind of almost like some insurance. You can kind of think of it that way, but sometimes it might not be feasible, especially for IoT and that's when you might want to look at the parent-child approach. Great, perfect. So there's a question from the same person from Ego who continues, wow, this is great, by the way he mentioned it. So does net data also notify me if there's an anonymity in the middle of the night? Yes, so this is, we do have, so net data comes with like lots of pre-configured alerts that are traditional alerts, handcrafted through years of experience and pain. And what we haven't kind of gone as far as to build automated alerts based on these anomaly rates yet, but you could and that's kind of, we're kind of keen to do that soon, but we want the ML to prove itself before we do kind of automatic alerts based on machine learning. Because the last thing we want to do is kind of compromise the integrity of our handcrafted alerts based on expertise over the years that we've built up until the ML proves itself that it's more right than wrong, typically. And so pretty soon what we're gonna do is make the anomaly rates available to the health engine so that if a user wanted to, they could easily trigger an anomaly. Like you might have a traditional alert at the moment say would be if CPU usage goes over 80% trigger a critical warning. You could easily have that then be modified based on the anomaly rate to say, okay, well, if the anomaly rate is still less than 50% don't do anything because it may be that you run at 90% CPU by design and that's the whole point. You're trying to optimize these nodes for CPU. And so it's really what you're interested in is if the CPU was to maybe drop or change pattern, that's when the anomaly rate would go up and that's probably what you'd be more interested in. So we are keen to make these anomaly rates available to the health engine. But we're not going, we haven't kind of gone as far as to fill the alerts off any of this just yet until it sort of proves itself that it's, until the ML under the hood, this is like the first generation basically. And so we want to get at the point where we iterate and improve. And this is why for now it's a little bit sort of passive as opposed to, it's not gonna wake up in the middle of the night with alerts just yet. But if you want to, if you want to, if a user wanted to do that, they could, but we just, we're not gonna do it out of the box anytime soon. But ultimately it would be kind of the Nirvana. And there's also a whole lot of other ML stuff we want to do to use ML to solve alert fatigue, because that's a whole other area of observability that's ripe for, you know, ripe for, there's lots of low hanging fruit basically. We started with anomaly detection, but we also, the next big ML problem that we want to start tackling is using ML to solve alert fatigue. So yeah, good point. And so there's more details, kind of, there will be lots more details in here in terms of how you might configure a health, you could configure an alert based on the anomaly rates. It's just, it's still, you would have to configure it as a customer alert, which isn't that hard, but it's not sort of out of the box just yet. Cool, so I will, I've got one last little bit of demo. I'll do a quick one because this is a nice one, I think sometimes. So let me go back here and kind of just turn back everything on last 15 minutes. And so in the, on these nodes, we have a, I have a little app running basically. So let me, I have an app running on this node and I'm gonna kill these VMs afterwards. I don't mind anybody seeing the IP. That was kind of one thing I was gonna do. People want to connect to this, they can. I'll stick it in the chat. Well, maybe that was a third demo was for people to come in here and have a look at the net data dashboard and we could see how that played out. First, what I'll do is I will connect to the dashboard and I'll take the last 30 minutes. And I have, we have a little sort of, a little app running on this same little container that's running a Python app, basically a dash app. And this is what we use to kind of do some proof of concept stuff internally. And so what I'm gonna do is I'm gonna kind of come in and just do some work basically where I give it this URL. And what this app is gonna do now is it's gonna query the local agent, pull all the data for all the metrics and it's gonna do some clustering and give me back a clustered heat map, which is also something I'd love to add internet data soon. So the idea here is this is all the raw metrics. So it's a big long, you know, big long kind of heat map and the order of everything is based on clustering. So you can see actually here's a good example of Gremlin. You can see when I did the chaos attack all the Gremlin stuff turned on together and actually Cron obviously maybe got trottled based on that. So you can see, so the idea here you can quickly scan and see the fingerprint of kind of which metrics behave together based on groups. And so if I look back and see kind of how did that play out in the anomaly advisor, I have basically this red guy, which was, yep, CNCF Live One was the one we did. So I will just filter for that guy. And if I highlight the area, same thing, you know, I see a spike here, I highlight the area and we see what the results are. And actually this is, so there's a small delay sometimes because when we do this highlight, there's a aggregation that sort of needs to happen. So I usually need to give it like 20 seconds or 30 seconds. And this is also an optimization that happens behind the scenes where we aggregate all the anomaly rates onto one sort of virtual chart and that's what powers the search here really efficiently so that we can get these searches. And so we can see here Grafana agent. So we've got a few things here, users net data. I can see Grafana agent for some reason, kind of interesting, let me see. And so sometimes this is a good example because sometimes it's not quite clear. So I think what I should do here is I should get sort of tighter to what I'm interested in, which is sort of this particular window and see. Yeah, and so I can see sort of, this is what I was after here where I can see, you know, my net data ML app container basically came to life here and you can see it started doing some CPU usage basically and you can see some network traffic as it was kind of displaying the heat map on the screen or it's probably actually the agent itself. And so you can see here, like this is basically a case where some container, something happened on the container and you can see straight away that this is the container of interest and you can see it at the high level system level metrics but you can also then see that the individual container level metrics. So yeah, that's kind of, that's basically the main idea here is to change the approach, basically to complement the traditional approach of a big dashboard with charts and you have some idea in your head and you click around and you say, oh, maybe I'll check network. Yeah, okay, maybe I'll check memory. And the idea of the anomaly advisor is to just basically use machine learning as the UX. And that's like the big team here, the bigger picture here of all of this work is observability has lots of areas where we can use machine learning as the UX basically and go beyond dashboards basically. And so that's the idea, you find a time of interest, you highlight the area and if it's, the next step will be, if it's something that you are, you do find it useful, we have the feedback here and the ultimate goal would be to build models to actually say, okay, but maybe this is okay. Yes, it looks strange, but is it something that you actually care about? That's the next step is to, it will be nice if you can give it thumbs up, thumbs down then you could build a model that could layer and actually which types of anomalies you actually care about and which types of anomalies you don't care about. And that's what, that's at the moment, this is basically saying, the model thing, something looks strange, but it's not to say that it's something you need to take action on or care about. That's where the human in the loop still kind of has to make the decision as the ultimate expert basically. So that is most of the demo stuff. Let me switch back over to my slides. There is, I'll do a little quick bit about under the hood because this last third part here was basically kind of what's going on. And so if I go back to say this agent here, we can kind of have a look and see what's going on basically. So on the agent, there is system.net say, let's take for every chart basically every second we have the raw metrics. And what's going on on the agent is that at the same time, we also are producing the ones in zero. So if I say often equals anomaly bit, and this is called anomaly bit because it's, we have implemented this in a really efficient way such that there's actually no storage overhead at all. So in the internal representation that data uses, we had a spare bit, one of our really clever C engineers figured out that we could kind of repurpose a spare bit and flip this bit when there's anomalies basically. So there isn't even any storage overhead to actually store all of these ones in zero. So we get them for free. And you can see here, what this is saying is, okay, at this timestamp for whatever this is, it's probably in, it's, you know, whatever it is, it's traffic sent. And basically net data considered this, the recent observations here to be anomalous. And so we don't just take the most recent raw data, we take a smoothed, different lagged kind of most recent five values or six values. And that's to try and get the pattern, you know, stuff. And that's, we went to a lot more detail in this deck basically on what is the pre-processing we do? What is the model we use? And so there's loads more kind of detail there. And we also have, you know, there's a lot of detail in the read me and there's also a Python notebook. You know, if one of our big kind of philosophies is that this machine learning should be super open, as open as possible. There's nothing magic about it. We want to educate our users on this. We're not trying to kind of dress it up as something super fancy. We're trying to actually get people to understand as the user, when it might work, when it might not work and then kind of have you then as the user be the one who's able to make the decision, do I trust this? Do I not trust this? And so to that end, there's like a Python notebook you can kind of open in Colab and it'll walk through. Based on one of our demo servers, it'll actually pull the data and walk through a Python version of like how this all works. Of course in the agent, the implementation is much more efficient. It's in C, it's a little bit different. But the general approach is all in here. And, you know, I'm always keen to get feedback. So any feedback and discussion I would love kind of feel free to hop onto the launch post in our forums and just, we can just start chatting. Yeah, and there's actually a question from the audience. Jimmy asks, hi, could you finish Keploy for TypeScript? TypeScript, what was the question? Let me see if I can read it. What was the question again? It's the second off one in the chat. But if it's unclear, Jimmy, you may want to elaborate a bit on the question as well, but... Oh, I see. Yeah, I'm not sure what that question is actually. Maybe it could be maybe it's some specific collector or something. I'm not sure. Yeah, well, Jimmy, if you can elaborate a bit more, then we'll get to your answer. But thank you so much for your question. And there's been actually a lot of comments from everyone saying good work or great stuff, Andy. So everyone seems to be very excited about everything. Great, well, it didn't crash. And so far so good. So there's just two last slides here, which was the main call out. And so net data is free. It's free, it's open source. There'll always be a free tier. So anyone who's interested, feel free to just kind of install the agent and enable this ML. There's like two steps where you need to just make a one line config on the agent. It's not on by default just yet, but the plan is maybe in the next six to five once it's battle tested, it will be on by default. And then there's a small little bit to enable it in that data cloud once you've claimed your node. And then, you know, any kind of feedback would be great. I would love people to kind of jump onto the forum post and give some feedback. And we also, or, you know, we have a Discord and we use GitHub discussion. So wherever it kind of suits, please feel free to kind of reach out and I would love to talk. I've got a special ML channel in the Discord where I'm always trying to get people to talk to me. And the last kind of shout out is just a big thanks to, we are an open source agent. We actually use Dlib machine learning library, which is itself an open source project. So it's always kind of, we always want to make sure that we're calling out. We're building on basically the shoulders of Dlib to actually further hardcore ML algorithms. And then kind of just the team itself. It's again, a general ML thing. It's very cross functional. We have, you know, really, really clever C people who are like C engineers who work on the agent. Then we have product guys who are actually bringing it all together to make sense. And lots of front end stuff for all those nice kind of charts. And also lots of UX stuff as well, which is the UX sometimes is the hardest part of all this. The actual, you know, producing the ones and zeros sometimes is one problem. And then the UX is as hard if not harder in some ways. And then of course we've got lots of backend stuff going on to, within that data cloud. So there's a kind of, I think we've covered the whole kind of range of roles basically on this project. So it's just wanting to give a shout out to the team because it's definitely a team sport. And yeah, feel free to try it out and reach out. Perfect. It's always great to do thanks. Amazing. So there's a question. Ro says, if that's what I needed, is it free? Unbelievable. Yep, it's free. It's free forever. We, that's one of our founder Costa has a catch phrase he likes to say, which is the value is free. So we, the open source agent is free and net data cloud is free. And the whole idea of net data cloud is that the whole point is that the value that you get out of it is free. And so there's always going to be free tier and it'll always be, you know, eventually we will add, you know, commercial offerings for things like authentication or, you know, typical enterprise-y stuff but the value itself will always be free. That's the main kind of one of the central tenants of net data, which I really was, I found that inspiring and I kind of liked that. So I like that we try and live by that. Perfect. Ro seems excited about it as well. Yeah, and I think now is a perfect time to ask all the questions as well from everyone in the audience. Do you have any other finishing words for the presentation? But I can of course kick off the question by asking a few, but please do everyone jump onto this chance to speak on and these brain a bit more and learn more about the topic. So could you talk a bit more about the current state of machine learning in general? Yeah, so my kind of main focus has been, well, machine learning in general I think is now officially becoming just another tool kind of like anything else. So if you're a software engineer, it's easier and easier for you to just reach to machine learning as a tool like anything else now. This wasn't the case, you know, even five years ago but now it's definitely much more being able to get machine learning into production is super, super easy. I mean, even myself recently I was playing around with BigQuery, Vertex AI and it was dangerously easy. Within kind of about 60 minutes I was able to have an ML endpoint up that would give you an alert CTR prediction basically. This is one approach to potentially solving the alert fatigue is to actually think of alerts as almost like an advertisement problem and build a CTR model. And we have all of our alert click data in BigQuery. So it was really, really easy for me on my own to basically train an auto ML model in BigQuery, deploy the endpoint and almost hand it over to the backend team and say, here's the endpoint that gives you basically the CTR probability for these inputs. So it's really, really getting easier to get machine learning into production. But I also find that in the observability space as an industry, we're still sort of very early on on the journey. So machine learning is still kind of this fancy new, you know, new feature. And it's as opposed to other industries like finance or insurance or marketing. Machine learning is just a core to what they do, like risk models, marketing, CTR models, recommendation engines. And it seems like observability is kind of a little bit behind. So we're only kind of starting on the journey of machine learning being kind of just another part of the furniture in the observability landscape. So that's kind of why I think there's lots of, lots of low hanging fruit we can, where we can actually have kind of still have a big impact with relatively little work and there's lots to do. So it makes it good and there's lots of data as well, which is great. Yeah, great. So clearly there's a lot of things happening from net data front as well. And then we just saw a really great demo about it. Yeah. So there's a question from the audience. So how do you handle a metric that is just really spiky or erratic all the time? Just really, yeah. So this is a good one. If it's just really spiky or erratic all the time, if that's normal, then that's gonna be okay. And so this kind of touches a little bit on the actual, what we do under the hood, which is we use unsupervised clustering. So if you think of a metric that has, it has a spiky behavior and it's kind of, it oscillates between maybe a spiky behavior and a less spiky behavior. What actually happens under the hood is we will train two cluster centroids that try and capture these normal behaviors. And so if it's just normally spiky, then and that's considered normal, then that's okay. That's gonna basically, you might have a spiky raw metric, but then the anomaly rate for that metric will be kind of just bouncing around, like 1%, 2% every now and then. And so that's the main idea is actually, if you can, sometimes I think of, if we could just have every chart, every line in the dashboard, if you could have a toggle, which is just converted to an anomaly rate, then all you really wanna see is just flat lines everywhere, which means everything is normal. And you don't really care about the behavior of it. You just wanna know, is this normal, yes or no? And so the idea here is that actually the machine learning model should learn these normal behaviors. And by normal, normal here depends on how long it's trained on. By default, it's the last four hours, where it can be extended to be kind of 12 hours, 24 hours. And we're looking at ways to kind of extend it sort of infinitely, but in a cheaper efficient way. And so for a spiky metric, if it's just naturally spiky, then it'll just be learned kind of as the normal. But it definitely also, it always depends on the particular metric as well, though, to be honest. And so one of the next big things we wanna do is basically make it sort of, from anywhere within that data, you can actually just look at the anomaly rate for a particular metric and just decide yourself if you agree with it or if you trust it or not basically. That's the next big thing we have to do is like, say I see this CPU user metric here and I can see that this is indeed spiky because I'm kicking off all these cron jobs and it's spiking up and down. On the same chart, I should have an anomaly rate line which is basically a sort of on the second axis maybe or somewhere, which is like a sort of a flat dotted line, maybe it bounces around 5%, 10%, but it never really goes up to 50 or 80%. It would only do that when say, if these spikes all of a sudden flattened out and became a flat line, that would be an anomaly. And that would be a real sign that actually the workload isn't happening any more like it used to. And when it flats out and goes smooth, that's when you want your anomaly rate to really jump up and show you that actually, oh, something's different here. So it's more about sort of what's normal, what the normal patterns are and what the model has learned. So, and it will depend on it. For some metrics, it won't work quite as well and for others, it will work. So it's always kind of a trade-off. Yeah, makes sense. Holy crate, thank you so much for that question. Keep them coming as well. So, essentially before I count, there's a lot of things that net data is doing in this space, but what is next for net data? What's the future, what does the product roadmap look like? Yeah, so the next thing, the next immediate thing is basically anomaly rate on every chart. And because we have, we've got all these Lego blocks done where we have the building block of this anomaly bit now is a core part of the net data agent. We are now starting to build features on top of it. And so the first feature was the high level top-down anomaly detection, which is the anomalies tab, the anomaly advisor. But there's also like a bottom-up approach where, while you're still doing your traditional flow of looking around like, oh, maybe it's RAM, you're gonna end up at some point looking at a line where you see this red line here and you wanna know this drop in this red line, is this normal or not? At the moment, you don't really know because you need to kind of have some context. And so you would need to kind of scroll out and look and see, okay, no, actually it's not that big a deal. And the idea here is that actually, if you could have at the same moment, the anomaly rate, then that would give you that extra bit of context at the click of a button without kind of having to think too much. So, and that's one way where you might sort of be able to empower bottom-up anomaly detection where throughout people's normal troubleshooting journey, you know, as they're going about troubleshooting things, they can actually also just see what's the anomaly rate behind these lines. And that's easy, that's just front-end work. We have the anomaly rates here. We just need to kind of do it in a way. This gets to the UX being the hard bit. Like, how do we do that in a way where it's not just for every line, we put an extra anomaly rate line and it gets crazy and it gets confusing. And so that's the next, that's the hard bit. We just need to take, you know, be kind of mindful of how we make it sort of seamless and easy for users. So that's like a short term. And the next big problem is alert fatigue. That's the next big thing. So, you know, this anomaly advisor is, it's now a way of life. This is the first version I have. Eventually I want to get to like even more fancy, you know, deep learning auto encoders and stuff, but we're not ready for that yet because we started with something sort of, something sort of middle of the road. So it came in, it's a good, good workhorse model. But I do eventually want to build up to more complex models. But doing that at the edge is the tricky bit. So we need to figure out some of that stuff. But the one next big problem I really do think we can solve or help solve with ML is alert fatigue. So, you know, and if that comes with all these alerts, I don't have any at the moment, but it comes out of the box with lots of alerts. And what we really want to do is sometimes you can, these alerts might not be configured exactly how you want them, depending on your specific workload. And so what we really want to do is basically implement, solve alert fatigue using ML. And this is something that I also don't think has been done elsewhere just yet for some reason, is in terms of like, you know, after we show you 50 alerts or so, and if you give us thumbs up and thumbs down on some of them, we should be able to make it, even if you don't give us thumbs up or thumbs down, if we show you an alert and then we don't see a troubleshooting session within 20 minutes after it, we can kind of infer soft labels and stuff from that. You can infer a lot from alerts, did somebody click the alert, did they even open the email? You know, so there's lots of stuff we already have that could be used to make basically alert ranking models that could say, okay, here's these, I've got 50 alerts right now. I can tell you that, if I can tell you with accuracy that the click-true rate on each of these alerts is less than 1%, we can then automate, you know, automate the routing of that alert. And so there's loads of room, I think, to bring ML into alert fatigue because it's a general problem that we all have. And I think there's definitely some low hanging fruit we can do there. So that's like the next big challenge while we're also kind of iterating on the anomaly detection as a way of life now, you know? Perfect. Yeah, so there's a lot of benefits I see here and then for sure there is. So what do you think is, what is the major benefit for companies, for example, to adopt this as well as is it monitor and troubleshooting or how did it go? Yeah, I think that like the major benefit is just to try and help with the search problem in terms of, you know, what I like to do is there's probably two main approaches. Like I like to come in and look and kind of read the news basically in our production room. I'll come in, like if this was my production room and I was, I want to kind of check what happened in the last six hours. Well, straight away here, I can see, whoa, something happened around five o'clock. What's going on here? And I can zoom in. And so this is where I kind of, you know, read the news of my infrastructure. But then there's other approaches as well, which is Costa, our CEO, he tends to use this much more kind of real time. So he's already got a hunched as a problem on some system, maybe it's from some alert. And then he flicks over into using this in a sort of real time approach to see, okay, in real time, you know, which are the things that are most anomalous at this particular moment. So there's kind of two approaches. There's like the real time troubleshooting in the moment, it might help. And then there's the more, you know, more kind of, more passive approach where you come in and check it's monitoring. So, but instead of, instead of kind of starting with 300 charts and 2,000 metrics and it's up to you to decide where to click, you can, we can show you kind of read it. Here's the most interesting things that change or that look the most strange in the last 24 hours, basically. Is this something that you missed? Yes or no, or you know, so it's, it's trying to kind of sort of solve the search problem basically a little bit. Great. I think it's time for a final call for questions. We are getting to the end of the stream. So this is a final call, I'll ask your questions now. But I assume that people is say, you know, later on realize that they would have like to ask something they can reach out to you on socials or the forum that you mentioned and so forth. Yeah, wherever suits, the CNCF machine learning channel is in Slack as well, I'll be in there or you know, hop into their community posts or Discord or it's all in the in the deck. And for anyone that's really, really curious in the deck there's also this sort of deck which goes into much more detail as to how it actually works. So if you are curious about sort of the machine learning side of things, this deck could be worth checking out. It shows how we, you know, how we actually featurized the data and how it all hangs together to get these ones and zeros out the back end basically. So feel free to have a look there as well if you're curious, it's linked in the deck. But yeah, feel free to try it out and I'd love to kind of love to hear from people. Perfect. Yeah, that's really great. And there was a lot of comments here already where I was very excited about it and hopefully they're gonna try it out as well. And I think the discussion can continue also in the cloud native live Slack channel if anyone has anything to add. Great. There as well. But yeah, so final call is starting to be over now if you are nearing the top of the hour once again. Any final words or finishing sentences Andy, from you? No, no, I'm just glad that we got through the live demo and nothing broke and I can, it was broken yesterday and we got it fixed and we got through it. So I'm gonna take it, have a coffee now after this I think and have a rest. Perfect, it's good. No demo effect this time. So you can take a breather. Perfect. And thank you everyone for attending in. Thank you, there was a great demo mentioned there again. So as always, thanks everyone for joining the latest episode of cloud native live. It was really great to have a session about how to power up your machine learning. Really loved the interaction today as well and questions from the audience. A lot of possibilities in their room, really nice to see that. So as always, we renew the latest cloud native code every Wednesday so you can tune in next week as well where we will have a session on certificate management with Linkerd. Thanks for joining us today and see you next week. Thank you, bye.