 So I think we are all set now. We can just start our wonderful webinar this morning. Hello everyone I'm Sasha and I would like to thank you for joining us at the next gen observability using open-source monitoring webinar Our today's speakers are coming from obstacles Scott Fulton Hesiot upscrews and Aloguha Hesitio at upscrews We are extremely delighted to have a guest speaker with us Carl Governer former chief information officer of Northwestern Mutual Before we start just a few housewarming rules this webinar The talk will be about the half an hour long followed by the demo part and And then we will have Q&A session at the end of the talk After the talk, it's a Q&A as I mentioned you can submit your questions in the section below under the Q&A During the webinar and we will try to answer to all of them at the end of the talk This webinar will be also recorded. So in you will have a chance to review it Once we send you a link later on Thanks again for joining and I'll let Scott Fulton see off upscrews take it from here. So Scott stages yours Yeah, well, well, thank you Sasha and Welcome everybody this morning or this afternoon depending on where you're here dialing in from and I know there's a lot of distractions in the in the world today And everybody's moved to virtual and so just getting any time with folks like you is is precious and so thanks for Making at the time this morning, you know, what we want to talk about is this real shift that's going on in the industry Obviously many of you are on this journey to cloud native apps. I How you see many changes here in the 90s when we had distributed applications? You know to the you know early 2010s when we started taking those existing apps and lifting and shifting them to the cloud You know to the decade. We're just embarking on We're really trying to take advantage You know of the cloud our architecture and benefits and really refactoring those apps and to Microservices and containerizing them and whatnot. I mean that shift is in full swing. I And Gartner estimates as you can see in the quote below that you know a basket of workloads Will be containerized in the next three to four years and that agenda as you probably know and in most organizations is Is driven by the developers and the business units and the folks that are in you know operations or cyber liability engineering teams In many cases they're you know struggling to keep pace with that kind of change because there's just a heck of a lot of new challenges You know, you have an order of magnitude more components That you have to manage that make up You know these modern applications You're releasing a heck of a lot more frequently. I You you have many more applications Then you had a decade ago so much of the business is is driven, you know by digital transformation And then the dependencies that are out there pretty pretty significant You know you used to write an app and and all the different components and services that made up that app for under your control You know these days if you're a commerce e-commerce company and you're doing logistics And tracking that that's a third-party service. You're calling out to if you're calculating sales tax That's probably another third-party service that you're taking advantage of and if you're using geo location That's probably a third service. And so the dependencies are pretty significant in in a modern application And so then when you think about you know monitoring Those apps, you know those modern apps need a modern approach to observability the kinds of tools that You know we're popular and worked, you know in the in the 90s You know aren't going to be the same tools that are architected to support, you know the current set of environments and so what we see is That you know in the the tools of the of the 90s in the 2000s that you know serve the apps of that time They were fairly proprietary. You know you'd go in you'd set thresholds You'd triage things and do resolution By kind of jumping between different screens of different tools You tended to have one tool for logs another for traces, you know another for for metrics And so the net effect was you know, they were pretty expensive to kind of buy and run And that's just not plausible You know in the in the in the in the next generation of apps. Well, we see the industry moving to Someone independent of obscrues is you know an open-source foundation for monitoring A vast percentage, you know of the R&D spend of those companies on the left I worked at several of them over my Career it was spent on you know the the sheer fact of having making sure you have agents that support all the platforms Making sure you could aggregate, you know all the metrics and logs in a central place And they had good visualization technology for dashboards. That's where a big amount of the spend was today That's much more plausible that foundation layer Is much more plausible through the open-source tool sets that are proliferating on the market and And then on top of that if you can automate a lot of those things and if you can put analytics Then you can really drive, you know pretty significant cost savings and more predictability around your apps So, you know, what's driving this change? Why why is open-source monitoring, you know becoming more popularized? Well, I you know, we have a few kind of key theories around that one Is that the infrastructure tech stack itself is is is increasing the open-source You know think of the databases you use to build modern apps. It's probably a mongo Might be an elastic a mysql. Those are open-source, you know 20 years ago most apps were built on oracle same thing with messaging same thing with analytics So the tech stack is open-source and so it's just a natural extension that the monitoring will be open-source as well There's a lot of new instrumentation Standards out there, you know, you have something like c-advisor around containers It doesn't matter what vm that container runs on it has the same set of instrumentation that you can depend and And rely on in the communities driving that I think there's also just a lot of you know changing priorities for these modern Apps, you know used to become used to be critically important to understand what was happening, you know in the code You know of a single of a single vm or a single, you know set of binaries now with microservices It's more it's more important to understand what's happening Between all those services so at the networking layer and the latency and response time and whatnot Is almost if not more important than you know, what's happening in a in a small segment of the code And then you know one of the last is just the you know instrumentation can be used for a lot of different purposes You know you use to aggregate all this stuff and ship it off to your monitoring tool and then you had You'd aggregate the same stuff are very similar and ship it off to your security tool and then do it again for your capacity planning And so now With these standards, there's the possibility where you can collect this stuff for once own it in these tools and then I You know ship it off for different business purposes, you know as opposed to every proprietary tool Aggregating and collecting this stuff on their own So that's those are some of the key things that's kind of driving the change that we see And not to mention obviously the cost of the of the proprietary tools that are out there So what are some of the popular ones? There are there are many The the ones that we most closely follow are around cncf's of the cloud native computing foundation You know that has kubernetes is it's kind of anchor and foundation project that's you know really becoming You know the operating system for cloud applications And so kubernetes provides a rich set of telemetry Both from a configuration perspective and a metric perspective and a networking perspective You know they can be used and analyzed in in higher level tools And then on the top we have kind of four key areas metrics through prometheus prometheus was one of the first projects to graduate cncf after kubernetes It's a time series metrics database. You know very simple To use you you just have a metric and some key value pairs It has a very powerful cq query engine. You know behind it Hundreds of different exporters for every database and middleware component that you can imagine It was authored by julius volt Who's one of our board advisors? You know at ops crews he he came out of google And invented it in sound in sound cloud And and so is very active in that in that community, but You know julius is complimented by some 500 other committers that are out there in the community A close cousin to prometheus is is loki Which is you know all about aggregating logs In a multi-tenant system And having some of the same attributes, you know of Prometheus doing all your kind of searching and indexing and so forth of logs, you know through loki For traces as there's the yager project. Yeah, yager came out of uber I And it really is at the intersection between observability and networking And helping you understand The the trace path latency errors Between all the microservices and containers that make up modern application You know the sampling and and really it's primary primary role in life is for offline analysis to do debugging You know of the code it's it's really a favorite of the of the develop community themselves So net net, you know, all of these these these four areas They have a powerful set of features a lot of contributors Really rapid adoption and then The great thing about these projects that we've seen they haven't really been bastardized by all the commercial vendors We see, you know, many many vendors just offering exporters To the ecosystem, you know for their area of technology and that's just great great to see You know just to give you a sense of how fast it's really moving For those of you that have a little gray hair like like myself You'll you'll be familiar with Nagios. Nagios was probably the most popular open source monitoring tool in the late 90s early 2000s And it's still active as well as a very active community But you know, in the scheme of things relative to all the proprietary, you know, monitoring foundations that were out there It had relatively small, you know market share and you compare that to Prometheus You know, that's only four to five years old You know, it has 50 times the adoption if you look at kind of the different metrics around github stars It has 10 times more contributors If you do kind of search analysis on google It's off the charts. So just in a in a short span You know much greater adoption than we saw in kind of the prior generation of tools For many of the reasons that I mentioned on the on the prior slide So to get you know a perspective, you know on this we wanted to bring in, you know, carl Who's been a fortune 500 You know c c i o for much of his career And it's seen a lot of these trends. So i'll turn it over to you, carl Okay, scott soundcheck everything good there. You can hear all good Okay, well, uh, thank you and it's great to be here with this crew Great to be talking to infrastructure and operations professionals because we all know the development community tends to innovate innovate innovate But then they kind of forget about The management the operations the monitoring and all these things that have to happen So as a senior technology executive and one that has seen multiple generations of this thing You have to uh, you when you look at value you look at it from three different lenses You typically look at you know addition to the top line Uh reduction of cost and then of course risk and so let's just talk a little bit about each one of those From a revenue perspective, you know poor monitoring can impact your net promoter score It can have it can impact your uh customer Stickiness or abandonment depending on how you measure it it also lowers your agility and frankly it could provide just a pure pure Poor reputation issues which leads to some risks that you may run in terms of cost You know how frustrating it is to have an outage in your modern environment and you're chasing your tail Trying to figure out where it is. Is it is it the vm? Is it is it the the kubernetes? Pod is it, you know, what where is this thing failing and Like scott said the complexity has gone up the number of components continues to Go up and so it's difficult to figure out where the outage could be or where the outage could be could come from As in the case of ops crews What they do is they could predict some of these things because they have a model a machine learning model that actually is looking at the behavior of the application I would say that there's a there's a there's a fourth element go ahead scott No, I was just curious, you know of those of those three and you know, we have a lot of people that are On the phone that are probably trying to justify these kind of things to bosses like you Which is which is the which of these three is the hardest to measure and any any tips on Well, we'll start with the easiest and we all know what our cost structure is So the easiest one is to say, you know, take it from a cost perspective The always sometimes the hardest one is revenue because Nobody will look at operations as a source of revenue but certainly if the system goes down and you cannot manufacture what you what you make or you cannot process What you issue, you know, depending on the business that you're in if the system is down and you happen to be in healthcare and you can't Take care of patients revenue is an issue reputation is an issue on the risk side So Those two tend to be harder Cost is always easier, but i'll offer a fourth one And the fourth one that i'll offer which is not on this page is experience This fourth one could be a wrapper for all three of these and that is it's it's very hard to to measure But net promoter score could be something that you could measure And when you have your crown jewels in the digital world Inside of these modern environments The impact is significantly magnified When your app is down and that is the channel that your consumers are using to get to you It will not be very hard to just swipe swipe left or right and remove that app Whatever you're using and go on to the competitors app. And so that's why this is so important Okay, good good So Given all that the question is Why are using yesterday's generation of systems monitoring and apply to today's modern workloads? I mean, it's a question that you have to ask yourself You have a couple three choices really use something from the past that has been retrofitted and dealing with Intrusive agents worse yet ask application developers to instrument their code and put telemetry in their code So that you can monitor it We all know after 30 years of trying that trick that doesn't work These legacy type of systems monitoring Technologies are also very noisy with lots of data require manual intervention They tend to be siloed focus on, you know, the mainframe operating system focus on the Distributed database side look with oracle or whatever you may be using They're also proprietary. So you're locked in so trying to take Legacy systems management technology that has been extended to quote unquote manage Kubernetes is not the answer They're intrusive and they're very expensive. So Which leads to my next point Scott on the next slide Can it go the other way can can the the modern tools Managed the legacy apps or is that too much to not not likely it is not likely that the Uh modern tool say for example in the case of you all with ops screws You're not you're likely not going to invest on managing CICS on the mainframe, right? You're going to leave that to IBM. They did it 30 years ago That's going to happen. Could you have a potential integration with logs or messaging? Maybe Uh, but the you know the the pipe dream of the single console the single intergalactic console That gives you everything you need from all of your systems You know, we tried that with hp open view in the 90s and we all know where that led, right? Right. Yeah So this is additive And why wouldn't you you have modern workloads? You manage them with modern systems management technology observability and prediction as to what yours Of the behavior of your application. Got it. Got it. Okay. Good. Um Which leads me to my next point which is Uh Why would you use proprietary system on the next slide, uh, scott? Why would you use oh right there? Uh proprietary systems closed source systems which lag typically in the uh innovation in the operations and infrastructure space and instead I would want to use the power of the community the thou, you know, the hundreds of the developers and it's going to continue to grow that are You know loki and prometheus and all the things that scott talked about all those frameworks are they're going to continue to advance them They're going to use the power of you know, the innovation that comes out of typically silicon valley And you you should expect Nothing but continued innovation over the proprietary closed end source type of systems From your traditional vendors that exist out there now the question is how how you'll make this how you'll integrate these and how you How do you make that work? That's something that uh, we're going to learn more about it. And then finally Think of what ops screws is doing as the commercialization of this uh cncf or cloud native computing foundation Think of their commercializing What Is delivered through the open source. So instead of you having to roll your own Which we have done for past technologies, right? Everybody remember trying to roll your own distribution of linux Downloading those distributions and then figuring out how to have how to make it work. Well red had made that easy Same thing with a vendor like cloud era. They made hadoop easy as opposed to downloading five things and then making version 7.6.2 work with version 3.4.7 You know that work is being done for you with this commercialization of the cncf Uh systems management observability and other families of software that exist within what scott just talked about That's what they do and so you can try it yourself And with that scott i'm going to turn it over to alok Great. Thanks girl Thank you alok you want to drive? Thank you. Um, i'm just trying to get my screen back up here Um, i'm going to take myself off video so i can uh share my screen here guys so picking up uh where Carl and uh scott have talked about let's talk about what the absolute solution looks like, right? So one think about it once we have this open ecosystem With with the microservices with kubernetes primitius and all the Collections, what can we what do we need to do? We have a very dynamic environment and in order to be able to be staying ahead of it as carl mentioned We need to be more predictive rather than reactive, right? So what offscrews does is build on top of that open Source framework for monitoring that comes in these kubernetes environments And essentially understand the application which we talk about in a minute And then once we understand the applications easing as it's changing detect issues because we build Uh the behavioral model of every component comprises the application that allows us to go into Detecting fault isolation because we understand the distributed structure And then causal analysis because at the end of the that's what you want to get to Right and the whole idea of course is to reduce the possibility outages Stay out of the curve now something contrarian that we do As we have mentioned is we don't think Ops or sra teams can be guessing what metrics to look at there are so many components you can have hundreds and thousands of Containers and services And trying to guess all the metrics that are tens of them in each and then trying to figure out what the threshold is Is not feasible at all So one of the advantages there is understanding an ML driven behavior model contextually and then understand When those behaviors have changed or there's a problem, right? And we want to do this of course without instrumenting the code or touching the application Of course, we also avoid trying to maintain The open source tool framework because that's already there. Why are we doing this? As we said if we can be predictive That reduces outages. That means your top line is above right as Carl was mentioning, you know, your nepromo score is up your revenues up Vigility if your apps ops teams are more productive less issues to handle and of course monitoring cost so Then How does then then let's talk about how this works, right? So if you think about the left-hand side being all those open source frameworks that we've already talked about We are essentially to use a term that one of our customers have talked about use the digital exhaust pull that data in And use it to both understand and build out the intelligence of the application both its understanding and its structure And then in background for every component or server is that comprises the application build a predictive model That allows us To essentially once we know that Find out when their deviations right again, as we said don't have to look for metrics or search for What to look for and when that change of behavior across the whole criminality state will tell us when there's a problem Because we know the structure we can then run it to do fault isolation And if you can isolate the fault and we have enough detailed information from our ml model Which can also provide you explanations of where that model and why those shifts have happened that cause potential problem That allows us to narrow down fence down the area where we need to focus on and then Find the corrective actions because we've narrowed it down to a granular level Now if you can automate that as we can in specific community spaces You can now close the loop now the reason that cycles back is because application stage Infrastructure changes scale out scale and happens right code changes are made So this has to happen on a continuous basis But essentially what obstacles wants to do help your ops teams and sre teams to be on top of it On a continuous basis and keep running on the background to help ops teams stay ahead That's important to understand That given all the different metrics on the telemetry we're using How do we kind of move this one of the primary things for us to be product proactive? And for ops teams to be able to handle that is start with the real-time metrics. It's like watching Your cell driving car looking at what's happening to see where the shifts are happening as opposed to Which has happened which is looking at events and logs of failures and then working backwards That's like csi for your applications So Using the metrics that we have and you'll notice there's one other Entity we've added we also have to look at the cloud infrastructure, right? So using these open store framework and the cloud infrastructure as we said We understand the application structure We build the behavior models and in runtime when the deviations that allows us to detect Problems that happen any and multiple of them can happen and we do them all in power Once that's done We understand the global structure and the dependency that we built from the application structure That allows us to narrow down c based on the dependency Which one is the likely cause and if you have specific events or let's say A failure and application component structure We can then go to logs to confirm as opposed to look for logs and work backwards So that's how we tie and log in this sequence and then if you see on the bottom of the screen Many times Failures and problems happen because a chain is introduced the change could be because of an application code change and infrastructure If something happens Kubernetes When that happens because we can also capture Continuously what the structure and the application a topology is using something called time travel for taking snapshots We can now tie those changes back Into that and cannot close the loop Finally, how does trace come in as opposed to trying to do full tail sampling? We can now say if we narrow down the focus of where the problem is we can now do direct it trace analysis Such as with Yeager and look at the span itself Now this whole sequence of how we do it and how we integrate all of this holistically is key for us to be Cannot be proactive and predictive now remember the big advantage we have in the Kubernetes system is We don't have to worry about having multiple specific tools, whether it's for logs or metrics or tracing It all comes in Unfortunately in that whole Kubernetes ecosystem So we are leveraging all of that in one place. This is the beauty of working with this open Kubernetes framework and use the system How do we deploy as we said? We are not in the application space, right? So we don't have to be invasive. We can just sit in the monitoring plane And essentially collect data from Those environments and as you can see we're collecting information on the cloud infrastructure the Kubernetes Configurations the real-time metrics of primitives. We collect flows that essentially are the flows at both Layer four and layer seven between every service there are services that comprise the application and finally Collecting logs into the framework so we can confirm just using all of these and these are deployed as open pods In your monitoring plane. We collect those compressive and send it down to Our SAS controller where we do the processing and then feed it back directly to the Operations for real-time viewing Allerting tying it into their existing Incident management and ticketing system, right? So again a full open framework without being intrusive while building and leveraging the intelligence that you have to build on top of that framework So It's a good time to now switch into kind of a demo mode. Let me do a little setup here What you're seeing here is a manageable application. This is a sample Simulated e-commerce application that we have here. We look at these gray boxes. There are six of them We look at these gray boxes. There are six of them These are standard Kubernetes services However, two of these the load balance and the ingress as well as the database Are also used leveraging SAS Which is typical because not everything in your Kubernetes State and your application redeployment will be just container services. They could be SAS could be serverless or api calls So as an example, you're going to use a elastic load balancer as an example as part of this As well as rds as part of the database service as well as back here So we have as you can see from this path The reason is when we go to the demo, you'll see how we've kind of rebuilt auto discovered and auto build the structure these three services from Lot load balancing the web service cache to cart management to the database on the bottom The blue ones are your Typical pods there are eight of them and this is running on a five node community service And using those collectors we can essentially Build out and start providing that capability So what we want to do is Do two parts here One we'll talk about how Prometheus enables the open source monitoring So we'll spend a few minutes because as we said Prometheus allows us to collect metrics from pretty much everything In this case the Kubernetes services and of course the open source Capability of dashboarding as profound so we'll leverage that we'll show you how that's done Then the question is how do we add this observability? So understanding the application and structure as well as the topology, which we call visibility Which is what we'll show as the first step using obstacles Second what we call our behavioral analytics with leverages Are ml and we'll give an example of how we proactively detect a problem that leads to fault isolation and third tie back with The time travel to see what changes may have caused this now just to make this interesting What we will do is we will inject A failure mode in that application On on something that is what we call under the radar. For example, it has a cache element We will change it and reduce the cache iteration to see what the implications are and see if our system detects it So without much further ado, let me switch screens here Into the live demo screen here I can get there Think while we're waiting here to be a good reminder for people who have questions submit them through the q&a button Welcome those questions in we can start parsing through those So what I just did folks is I moved into the prometheus Monitoring using grafana. So what you're seeing here and I can actually update this this is running for a while so the top top one here if I can scroll up you can see is the The cpu. I believe the cpu utilization of all the services and if I can scroll down here, you can see The different ones were the database server the cache the engine x and you can see as we go down here We are getting continuously what the cpu usage is On our per basis. There is about six of those eight services. Some of them were outside the scope of this So card cache card server db server engine x payment. Those are all the six service we talked about we can get the cpu consumption at any time going over this of course We have the full history of this similarly when we talk about memory usage You can see the memory usage of this and then finally as you can see if you go down here We can also look at the rights primary the rights are happening on the database server As you can see that and when I hover on that you can see that the amount of rights are happening for all of those, right? so Out of the bat once you've deployed Kubernetes this you essentially can pull up all the metrics, right now What's the challenge here? We know the metrics what we don't know if you remember the structure of the application We don't know how they relate to each other This is where we need to build now if you were the developer you wouldn't know that But ops and sra teams don't have so this is where we come in and what we want to do is essentially Show you what it would look like If obstacles were there so what you're seeing here for example is exactly that same structure that we showed you before basically auto discovered and built In the the application itself. So if I hover on this you can see Coming from the ingress side It actually goes into the ELB if I click connect you can see we've captures automatically from Kubernetes. We know this is a the the ELB from the cloud vendor Goes into the engine x service goes into the corresponding container Goes into the the one inside the application and as you can see when I hover on that you can actually see The average response time between the engine x service in its container You can see 14 microsecond on my port and when I go there here you can also see I can See the connectivity as well as the response time and we have essentially discovered Coming from the ingress to the web server to the two containers to the cache element To the cart server that takes the data out of the cache to the database, right And the nice thing about this is the the ops team didn't have to enter any of these names All this structure or these dependencies Main reason is you can't even keep up with it, right? So that gives the immediate structure of this but structure is only one part of it, right? So if I look at any of these kubernetes service, of course, we have all the Aspects to this. What about the container that's sitting on? Remember the container is sitting on not just this as a service It's actually sitting on kubernetes notes and sitting on top of the infrastructure So we provide you that dependency as well. So here's an example the application tells you what the primary Metrics that are being used without your guessing on it. What's coming in and what's going down in which Containers and then for this service, what is it shared on that kubernetes node? And then of course, where is that kubernetes node sitting on what aws including its storage? So essentially you have the full service dependency template as well as the underlying dependency on kubernetes and the cloud vendor So let's just go into that just briefly and as you can see here This gives you the visibility of all those five kubernetes nodes And if I can just move this out here and scroll down It's a little trickier, but you can see this one two three four five And if you notice that ip address It tells you exactly what the kubernetes node is all the allocation and how does it allocate across the container that's sitting on In fact, we want to give you a real-time view of that And so this is an example where you're trying to understand How much has been consumed in real time across all the key services as you can see for the database on the cache How much cp and how much memory? Why is this important? kubernetes can decide if you did not set the right limits And one of them is not set properly an existing guaranteed Pod might take more of the resources and can evict this allows us to Be proactive by using this data and make sure that Pods can be moved around for pod balancing. It also tells you what the sizing should be based on that Now let's go back to the application view that we have Of course, we can understand metrics, right? So if you look at any of these metrics, remember, all of that is coming in directly From kubernetes and you know exactly what it's consuming Over the last one, you know, whatever time frame you'll notice We don't have to guess because we know what metrics are important and what's not So there is no rights going on to this component. This allows us to build that behavioral model So let's switch to that piece on what how do we understand problems? The way we understand problems as we said is instead of guessing we have built behind the scene A predictive model collecting data as it's running for every container and service that comprises the application You notice there are two Containers that look red here Now what that means is Something detected as behavior changed as I mentioned to you earlier We essentially reduced the cash hit ratio It's the implication of that you and I know that means I'm going to have to pull more data out because it's not in their system And it's going to go downstream even though the total request coming to e-commerce Service has not changed Can we detect it and the answer is yes If for the same number request the cash hit ratio was no longer the same The system will say let me tell you what I found I found that I'm transmitting more data out and pushing out what we call our supply side to the outside the total amount of counts So this is going on continuously and being managed and we can provide insights to this even if the cash Component hasn't failed right? So this provides the detail on that and of course we can tie it to the applications and the metrics history Which will go back and say when did that change happen and you will see When that change occurred we know exactly when that happened Right how much you were then also the amount of cpu that processed So that gives you a proactive view starting to look when things happen, which might cause a problem You'll notice there is a problem here on the database side It went red. So if I click on that that says Hey, I've got more cpu processing going on. It's probably because I have more data processing If I go to the insights again, I can tie it back to the Metrics history and sure enough and the question is why did it go up when I when someone like an ops team sees this How does one decide which problem which one caused the other? So one of the things we do what we call ops tracing in real time Is provide something for the ops team as it's happening before any failure happens to be able to Pull those services and because we know the structure we know all the paths And we can see what the dependencies are there. So one of the things we can do Is say let's look at the metrics across those right? I'm going to increase this to about an hour And because it's going to be kind of small I might change Change the screw and go to some detail screen now It's hard to see this detail although you can see this So what I'm going to do is show you the detail view of how they are tied together, right? This is essentially taking the path from the Cache to the cache management of the database Let me switch around and show you a little more detail and zoom in on those That might be useful to look at so it can help you understand what's going on. So let me give me a minute here to switch screens and Show you how I would find that I can figure this out here coming back here What you will see Is oops I can show this again. Hopefully you can guys you can see this If I look at the database server itself, you'll notice That it's total the total amount of data going in both from the cache to the cache manage The data starts increasing about the time when that change happened Only that you can see the CPU utilization went up. Which is what happened with database And then finally the total amount of writes went up So because we know this relationship We know what led to which is the cause which is the effect Real time looking at the metrics and understanding the behavior and the structure This allows ops teams to be ahead of time to say hey if it's invalidated can I act on it allows us to do fault isolation Right without trying to do You know detailed offline tracing Finally, how is this tie into what change happened? So one of the things that we can do Is monitor because we do something called taking snapshots. We can look at when these changes are happening at any one time So as an example, if I look at this I've taken a snapshot about an hour ago And you can see these containers are blue. There were no changes A little over a little bit later About 26 minutes later. I think when we introduced that change in the cache If you go click back you can see they went through it again Nothing else failed. The question is what caused this change and what can provide you as this is happening is actually say Go look for the differences because as we know changes cause problems And if I click on this we know exactly at what time those happened When the database started increasing cpu and then question is at this point if that change happened imagine if it was a code I can roll back and Essentially restore the application in this case We can't because the configuration on the cache was wrong and we have to decide whether we increase the configuration so Just so hopefully that gives you an idea of what we are talking about how to do proactive observability on the application as we are talking about Using this open frame. So let me quickly conclude on what we are talking about here So if I go back to The sound if I were to just summarize Um, and kind of point out what we just did What we have essentially done is shared, uh, you know Provided you a view of how we went Just go back to the screen if I can see it essentially Leveraging what we call frictionless non-invasive existing monitoring framework that came at Kubernetes using that To build the application understand the structure and understand behavior And that dependency and all of that so that we know what's going on We have this inside view holistically of the application We also without adding any code instrumentation you can see live those changes those flows this dependency Without adding any additional heavyweight infrastructure And use that intelligence in our runtime mode With that predictive capability to detect and isolate faults, right? Remember there is no need to start guessing and deciding what to look for and what thresholds are set No trying to figure out by correlation Which does not lead to causation of what is happening which means less false positives less false negatives less work Lower empty tr lower empty tr means higher availability With that, um, I'm going to pass this over to scott and thanks for taking the time Thanks. Uh, thanks a lot. So we'll um, we'll switch to Q&A now So we've got we've got some that have come in Um via the chat window and then some others that have come in through the actual Uh actual q&a So let's just let's just check that out. I've answered some of them Uh, this one's probably for you, uh, a loa. So application infrastructure behavior Uh itself changes quite frequently from release to release Example introducing a layer between two services like apogee Will cause increased delay and possibly also cpu and memory usage Another example would be reduced Or no caching due to the introduction of personalization. How long does it take for ops crews to learn That this is the new normal Great question and and it's very very relevant So one of the things that we do is whenever we see a change that we detect from kubernetes events Or any other events like even I didn't show this but namely changed or infrastructure change We would pull that up from the cloud layer or the kubernetes When that happens We essentially what we as we switch from being a pure predictive mode to our learning mode We start collecting data and essentially go into this learning mode So for any component that change or a new service that's been introduced we start collecting data technically You could start getting understanding of the application behavior in one But really what you want is a wide range of operational behavior because what our ml model does is actually understand correct regions of operation So for different demands and services So typically in the default mode within about 24 hours We get the full understanding of an application and we can get a predictable more importantly Because we cannot see the full range or demands or requests coming in every time we see a new change Or a new demand so that we have not seen new data We go back in to collect that data and make incremental upgrades so if a change happened because you interest a new service or A infrastructure change we would detect that and basically kick in to do that Okay, good. Uh, look and then I see a second one here Are you able to see the exact data In the data flow along with the lack of encryption? And maybe either you or or shridhar Yeah, I think she there is our resident Right now we are collecting the data at the host level So, you know, we are not we are not you're assuming that the app is not encrypting it So we can collect the data through because that's what permit is us to pull the data out So if it's coming out from there, we can see the data in the clear and that's why the way that's where the industry is going But i'm going to let shridhar comment on that because he's tracking this much more closely shridhar You want to add to that? Shridhar might be on mute You know, if you want specific answers, we can follow up Yeah, sorry. Sorry. I had my headset on mute Um, I trust you can hear me properly regarding that question. It depends on the environment We are set up in in certain cases we get visibility to the clear text In which case we can look at certain details like the http headers and do some level of a packet inspection sort of Analysis if we are operating in a service mesh environment We have got specific configurations that we've done for istio that will give us what we need And in cases where we absolutely don't have any access to the clear text Then we would be operating at a level four level But we still discover all the connections discover all the interact interactions and who's talking to whom And we are able to wire all of that in into the rest of the information that we get from kubernetes and the cloud So is that a trust that I hope that answer your question That reminds me shridhar one of the things I did not show in the demo given the time constraint is we are capturing not only the Bites and packets of layer 4 but also layer 7 at the url level request counts and response time So that I didn't show that in detail Okay. Good. Thanks a lobe and shridhar Another one in via chat So other monitoring vendors have started to integrate with these same cncf tools Is that any different from what ops crews? is doing So I I I can take that one Yeah, it's quite different You know from the ground up we've architected on top of these On top of these tools And we're embedding these tools And we're providing support and distributions You know for these tools You know most of the traditional You know legacy monitoring vendors that are out there You know, they've just taken the data from these tools and pushed them up to their cloud or their central Platform they haven't fundamentally kind of re-architected their stack to sit on top of this stuff And so it's quite a different Model you'll be on your own to kind of support those tools and maintain them and upgrade them and so forth And then further You know The central repository for this kind of data, you know is those tools in in the case of our deployment So that you can use it as we talked about earlier in the presentation For more than just monitor you'll be to use that same beta for capacity planning and security and whatnot The the legacy vendors tend to suck up all that data and store it in their cloud or their central platforms And you you pay for that, you know ultimately we only pull the data we need for our analytics and the central store remains those cncf tools So it's quite a different approach. You know, obviously we had the advantage of Being a much younger company and starting in a new generation And scott aren't you all you and a lock and shreed are and the rest of the crew aren't you all veterans from your usual suspects Of companies that yeah, yeah So you took some of the lessons of the past right and you're applying them And you kind of I mean I these were conversations that we had if we were to do this all over again instead of having you know bmc or hp or You know Dynatrace or whatever pick your legacy package You had an opportunity to do it over and so here we are With something that you all have created that is for the modern world next generation Exactly exactly. Yeah, I didn't have that and this non invasive reduces the impact on the application That's a huge yeah that part I love for the application development crew right because I mean You know as a as a senior leader in IT Having a purview of both the voices from infrastructure operations Which sometimes can be conflicting with application development and you got to listen to both And you got to say you know folks we all got to get along thus the reason why I love DevOps and this is very appropriate for that change in philosophy of the application developer throwing it over the fence As opposed to have it be integrated and ops crews brings that integration and helps with that culture change for DevOps Yep. Yep. Absolutely. Um good one one other one In chat carl that might be fit for you Let's see when you have embraced These types of open source tools in in in your organizations What's been the primary driver? Is it a is it a cost to mention? Is innovation? Innovation innovation innovation. I mean You know, certainly there's a cost element to it, but nothing is free Nothing is free. So if you think you can just download the open source and patent You know do the free thing and ride on top of it. Then you become the integrator So it does make sense to bring in some uh a commercialization which ops crews does for this C and cf uh set of frameworks, but at the same time to meet these in this day and age It's innovation. It's leveraging the power of the community. There's a cost issue There's also legal and security ramifications, right? You got to make sure your lawyers are on board You got to make sure your siso is on board because just downloading open source and running it You know, you may get I know it happened to me Maybe five six years ago where we got a lawsuit From uh from a company that was a patent troll because we downloaded a little itty bitty piece of software That was in all the developers downloaded and embedded in their stuff. So yeah, you got to as always you got to involve all the parties Your your attorneys your compliance people your vendor management people Uh in addition to the it professional Got it. Got it. No good good insights carl um Okay, and then there's another one here on chant Um, maybe for you a local shooter, uh, how how scalable Uh, you know, are these tools like prometheus and loki? I I've I've I've seen them, you know pretty pervasive in startups, but are they really ready for, you know, large enterprise environments? Sure. I'll take my first shot of that The good news about uh in the last two years actually in prometheus. There's scalable options. They're also cncf projects The two ones dominantly are thanos which allows you horizontal scaling as well as able to collect more data and cortex And we have had actually customers scarred if you remember that we're using thanos as well to scale anything There was one customer a call had about 29 or no 39 sites they were collecting data from and actually Federating it and they were using thanos and we've come across more. Yeah Now look is a little newer but lookie has a similar philosophy in architecture and how they scale out similarly You know the backend data store has to scale out to shorting, etc. And the lectures also scale out to rise up Okay carl you want to add to that or is there something on your mind? No, no just watching the time so yeah, so um Yeah, I just add that You know on those tools, you know, they were invented by you know, some of or they were incubated and some of the Um, you know largest IT infrastructures in the world like google and uber and whatnot And yeah, you'd be amazed. You know the the the the second largest The largest kind of brick-and-mortar retailer in the world You know is using these platforms The largest telecom in the world is using these platforms. So you'd be you'd be surprised at The level of scale on them So scott as we end here. I got a question for you. How do I how do I use this stuff? How do I get it? How do I get my hands on it? Isn't it obvious in the slide carl? That's like a that's like a setup question. So, um, yeah, I mean you You you you can go to offscrews.com And uh, and just sign up and register on our portal. There's a few adapters you'll install Um, uh, if you have an existing prometheus or Loki environment a few adapters you install or you can take the full package And install that in your environment And you'll you'll start getting visibility in about 30 minutes and the analytics will start kind of producing results in about a day so Yeah, I feel free to feel free to jump in and uh The Last and most important thing we got a short drawing here. That's why all of you are really here For this oculus vr headset. So I've I've scratched all the names onto little pieces of paper and I'm gonna draw them out of my my secret mug So, uh, I got matt public Matt public. I don't know what organization From but uh, he's a lucky one. Do you have to be do you have to be on the call to win or or you could I think he is because I I checked in the beginning and I just screenshot those names and and so matt is our winner this morning. So congratulations matt Like to thank everybody for attending. We will post the webinar Up online on our youtube channel over the next couple days And if you have any other questions feel free To drop us a note and thanks again carl for making the time this morning Thank you all everybody. Enjoy the rest of their day. Thanks so much Thanks. Thanks everyone. Thanks for joining. Bye. Bye