 Excellent. So yeah, hello everyone. Thanks everyone for joining. My name is Cesar Quintana. I am a Principal Solutions Architect here at Obscrews, and I am joined today by Matt Sarabian, and I am an engineering manager at Avis Budget CIPCAR. So, you know, we're here today to talk about Avis's digital transformation and how they are leveraging Obscrews as part of their sustainability stack. And, you know, with that, leveraging the open source tools that, you know, many of them are part of this, you know, one of its foundations, you know, CNCF project. So, I will let Matt introduce himself as well, but just a little bit of my background. As I mentioned, I'm a Principal Solutions Architect. I've been in the observability space for a good, probably going on 10 years at this point. And I've been in the IT field for over 15, so it's been quite a while that I've been around and I've seen a lot of things change and open sources, you know, some of the coolest stuff that I get to work with day in, day out. A lot of Kubernetes, so really excited to be here with you guys. And I will let Matt introduce himself. Hello, my name is Matt. I'm an Engineering Manager at CIPCAR. I'm Avis Budget Group. And I've kind of waffled between building products and working as a consultant to help other people build products for a while now. I've worked on some pretty large scale web properties like TMZ.com. And I've done IT administration kind of traditionally before all the cool cloud stuff came around. And I've been a full stack engineer as well. Thanks, Matt. Yeah, so, so, so as we mentioned, we'll kind of jump right into it. But as mentioned, Avis has been doing a big, in the middle of a big transformation project by leveraging Kubernetes, leveraging open source. Matt will talk more about that. And one of the things that was done is that they started leveraging the open source tools, but still saw a gap for some of the things like a smart layer and data unification for all these observability tools. So, as is part of any transformation project, there are some issues that you get away from as you move to an architecture, but, you know, as you leave some issues behind, you open up some new problems. And, and thankfully we were able to partner with Avis to help them solve some of those new observability gaps that were created, you know, as the migration to the new project moved. So, with that, first we'll let, you know, Matt talk a little bit about that project, how they're leveraging open source and then how they can obstacles and some of the stuff that they do with obstacles. And then what we'll do is I will give you a brief tour of what obstacles is and we'll do a demo and and then jump into Q&A. So, I know that, as was mentioned earlier, there's a chat box for Q&A. So please feel free to ask, ask questions away and if it's, and if it's something we can answer on the spot, we'll do that. Otherwise, we'll leave it to the towards Q&A portion of the of the session. So, with that, Matt, I will give it to you. Yeah. So as you may or may not know, Zipcar is owned by Avis, but we're a pretty large multinational company Zipcar also operates in a couple countries. And, you know, we had had kind of a large technical footprint right at Avis it's been a company that's been around a while they've got mainframes that'll probably never go away and maybe some of you can relate to that. At Zipcar, being a little bit newer of a company, we were kind of first into the cloud. We adopted containers really early, and we had a hybrid cloud environment when we were required by Avis and we had kind of a path to bring everything into the cloud, where pretty much is today. And, you know, our mantra of building out that platform at Zipcar and and that continued as we started thinking how to bring Avis to the cloud was, you know, how do we find the parts of the platform that are most commoditized in either the open source or the marketplace, because we want to stop doing those things ourselves if the stuff in the market, and the stuff that was out there was doing it better because you know generally as platform engineers it's hard to know sometimes when you're adding value and it's never a case that you're adding value to an organization by competing with Amazon or a large open source project that's obviously winning in the market. So, you know, what we wanted to do is find things like EKS to allow us to leverage a great container runtime. And then we could kind of bring teams into that new cloud native way of working. So teams that traditionally were maybe on prem in virtual machines kind of doing that you stand up the infrastructure you have this provisioning step. And there's various levels of automation involves we wanted to go to a full get ops, people are making poor requests, we're building artifacts in dev and we're pushing it across all the environments you know there's no such thing as a pre prod build or a prod build anymore. There's a container that's built in dev, and we push it out in a controlled manner. And the idea is, if you do that you incentivize teams to do kind of small controlled release. We have big giant former releases where release managers may or may not have context for all the technical changes, you want rapid iteration and the ability to release 24 seven. Whether or not a team is going to decide to release on 4pm on a Friday, you know, maybe we'll leave that to them to decide but they could, and that's what's important about delivering, you know reliable platform. And you know at this time, we're like, Okay, how do we do this in a way that's affordable. And our early decision based on what we had learned through zip car was like, we're going to keep as many things lean as possible, right we don't want to have a big steep curve here because cloud native is new for a lot of these folks, they're moving from an environment. Why do you say that main frames will never may never go away. Generally because legacy applications pay our checks. And it's very hard to move off of them, but we're trying it at Avis and we're making some good progress. And the big part is if you're used to building those kind of things you want people that know their applications to be able to worry about writing application code and not have to think about the platform or necessarily monitoring on day one. The 90% use case that we find as developers want to know is my app up. Is it does it's networking working as it's supposed to is routing working. What's its resource consumption how much CPU how much memory, you know, those are the things that generally get people through the majority of early development. And then later they say, Oh, you know I really like this custom metric or there's this specific piece of business logic that we might test. We wanted to be able to add those things on as teams moved through that maturity model, and we kind of wanted to skip that awkward stage that some companies have to go through when they move to the cloud where it's like, everybody's kind of doing it a little different. Do you really have a containerized microservice or do you have like kind of a VM that's been stuffed into a Docker file. And it's like golden and there's secrets in there like how do you go all the way cloud native. At Avis you know there was a team that was really ripe for this they presented a great organizational inflection point for us, and that was the connected car team. They were definitely feeling the pain of slow iteration cycles they were hosted on a VM on Prem, they had dependencies on the mainframe but they themselves were not yet, you know, tightly tightly coupled to the mainframe. And they had experience with some degree of open source monitoring already. So we looked at this and they were they were excited to kind of modernize. And so we work together and we all kind of learned we paired with them our platform team, their app development team learned a lot of the cloud native stuff together. We had executive support to get this done. And in the end, we had a totally, you know, see ICD environment developers could spin up ephemeral namespaces copies of their services that were totally isolated to test with, move through each of the environments with tests and monitoring and alerts, and all those kind of major use cases that people coming from VMs were used to being able to debug and look into. You know everything we had at that point was completely open source we hadn't coupled with anybody yet. We kind of really wanted to stay focused we didn't do any POCs. We weren't looking for additional things because people weren't ready for them yet. And this was great it allowed us to really control costs. We were able to show a lot of success when we were doing this. And we had a lot of wins here like we adopted Loki for logs really early on at zip car. We've been using Elk stack and Elk stack is cool but I have a personal pet peeve of it I think most developers don't actually want full tech search they don't want like stop words removed and periods removed they just kind of want to like rep on logs but we call it full tech search and a lot of people like that sticks. The downside about moving away from elk and there's only really one we save tons and tons of money obviously a 40 note elastic search cluster for logs is pretty wild zip car was generating. I think about two terabytes or so of logs a month so it's quite a lot and with elk we didn't really have to worry about how our developers use logs. They would search for stuff if their thing was not full tech searchable because it had you know punctuation in it or words that were going to be removed they didn't always get great results but they figured it out. And by and large it just worked it was expensive but it worked with Loki we had to kind of spend some time and really learn how our developers wanted to query logs so we could tune it right. So, in the end, that pain point is certainly worth what we're saving I mean our logs today cost basically nothing compared to you know the hundreds of thousands of dollars that they cost us when we were running in Elk stack. Prometheus, I would be shocked if anybody on this call hadn't heard of Prometheus. It's really cool. There's lots of open source support for it spilt in exporters we can get metrics from tons of sources so that was kind of a no brainer for us and, as I mentioned the connected car team which was our, our initial team that was was going cloud native and moving all their stuff in there. They already had some experience with it so it was a great use case for us and it's performed really well. Grafana. It's kind of incendiary some people love it some people hate it some people play it's really hard to build dashboards. Some people say it's really easy to build dashboards. Now there's open source dashboards you can load by ID so it's a mixed bag. Generally it worked really well for us. There were open source dashboards we used teams as they got a little further and started making on Prometheus metrics they made custom dashboards. It kind of was still like you're going around in different places in Grafana and if you don't know where to look. It can be a little bit hard to find information sometimes. And Istio and Kiali great open source projects as well. Mostly in the beginning all we really needed were envoy features a lot of people, you know we're excited to deploy Istio but it wasn't until it wasn't until we really needed true service mesh like service to service communication governance that we had to actually deploy those things but they're working great. There's a question that's just come through and I'd like to answer it live it's about infrastructures code for Grafana dashboards. That's a great question. So, while we were doing this, this was the one con of Grafana we were like, oh man, like, how are we going to do this. And so in Kubernetes we had config maps, and we had everybody commit their source code. And all the JSON they'd like build the dashboard, export the JSON and commit it back into the repo and then when Kubernetes would deploy Grafana or automation would you know see that and generate the config maps. And this kind of was awful, because it worked right don't get me wrong. It got people used to committing dashboards as version control which is awesome. It definitely was able to, it was definitely able to load everything into Grafana that way, but it was cumbersome and if people were new to it they didn't always know where to go and if they forgot to export it out of Grafana you missed it. In the end, what we ended up doing with Grafana is just hooking it into RDS as a back as a backing store. And now, and we'll kind of talk about this a little bit. So we have more people on our platform. We just say, hey we encourage you to commit these into your repos it's good to have a copy of the JSON but we're going to back back feed it into RDS and now people just kind of edit it as they like so in a little bit we lost that battle. We still commit ours to source control. The problem is when you do that you can't edit it in the Grafana app. So that that is a bit of a learning curve there. And this is cool like in the end, you know, other teams wanted this they were like this is amazing. We had a lot of success. We had a lot of open source team a lot of open source wins. And other teams are like I want to be able to iterate like that I want, you know, to be able to get all of these different things and and move my team forward so that was that was great for us. The problem is we're talking about that 90% use case and now this is like the 10% right in connected car we were together. We were like completely embedded and now there were lots of new teams. And maybe they were brand new to this some of them were brand new to get. And so to move to fully cloud native, you know, you have to fetch your secrets at runtime, everything's containerized it's Kubernetes, it's Docker files it's everything that you're not used to. If you've been building applications on a mainframe or, you know, fdp or scp jars and wire files to VMs. And, and people didn't always know what to monitor like sure we had open telemetry but what are they going to monitor. A lot of people have never seen a lot of these tools before and CNCF and the Linux Foundation has great trainings. And that's, you know, an awesome resource. But if you're selling people on a platform of being rapid iteration, and they get there and there's a cliff they have to climb. It's a bit, it's a bit tough. And then we had teams that wanted more advanced features like tracing. This is the time what we've found is like, if a team doesn't have like a team doesn't have a truly my services architecture can be hard to use tracing. Like a lot of teams I think have left to their own devices and this is not an a of a critique this is like, I think generally engineering teams tend to build distributed monoliths. You know, especially if they're pulling from something monolithic like you think you're breaking it up and then when you're actually done and you look at what you built. So a bunch of things need to be versioned together all the time. It's not necessarily as as cloud native microservices as it could be. And I think, you know, we'll talk a little bit about how obstacles kind of solves this, this high level need for dependency maps, and getting people to kind of see where these type couplings may exist. And also, we had this issue of, okay, now there's all these other teams when connected car came. They already had a plan right they had something they were wanted to break up. Everybody agreed on the architecture. We were involved at that point, but now there's too many teams for us to be involved with every single one how do we make sure that what the architectural committee, you know agreed to, and what was you know rubber stamped is actually what was built. What the, you know, our saying is like if your architecture diagram isn't live it's probably update somebody may have changed it. And this is where obstacles kind of came in for us. We had this high level issue of like, now that we have all these different things on here, who's going to tell us if our monitoring stuff is down. And Ops Cruise is the great, great partner for that they're an external entity. And if our open telemetry stops flowing to them, we can get an alert that says something's wrong with your Prometheus, which has happened you know we've had Prometheus storage fill up on one of our replicas. And we've been able to see like, oh okay great there's a problem with our Prometheus. Some people will tell you in Kubernetes, you can't monitor the cluster that you're trying to, you know, observe into. If you're running that monitoring stack on that cluster, and people will tell you you need a second monitoring cluster to do that. The problem is, then you have two clusters to manage and who watches that cluster. So, in our case, having a partner like Ops Cruise had always kind of has always kind of been a goal of ours to be able to have something watching our stuff, so that we know, hey we're not getting any alerts right now is everything fine. No alert manager is down. So, the other thing is that open source support so we're bringing in a lot of new tools and new teams. And the learning process, you know as we've just been asked in in the Q&A is something that we're still working through. What we kind of did is identified like eight key points that were necessary and some of them were cloud native things like why database migration tools are good and why you have to fetch container secrets at runtime and you shouldn't put them in deployments and you know general get workflow stuff, and others are more based on specific tools we picked like how to write a mapping to do routing in ambassador for envoy and things like that. So, we've identified this like eight point bullet list that we try to have teams select into and if if the team is going to start deploying stuff, and it's in containers and they either have somebody on the team or many people on the team that know most of the things on that list. They're going to be able to move fast. If they don't and they need to learn all those things, you set the expectation with leadership that like, they will move fast eventually, but don't set aggressive deadlines because everything is new. I think that learning process is definitely a challenge service. So, you definitely have, you know, we haven't integrated Yeager yet, because a lot of the teams are still not at that point of doing full distributed tracing. So, what we're going to choose is let us do is kind of look and see how our services are talking to each other at a high level, which is just as good for a lot of our cases it's that dependency map. So now, you know we've been running it for over a year now we're actually looking at, you know I think it is time to turn on Yeager some of these teams are ready for it it's going to be great. You'll see in the demo, if great live real time inventories of the stuff in AWS of the stuff in Kubernetes, you can see. Yes, these things are built and I'll discuss in a moment some real world cases where that's helped us debug some issues. One thing that is always impressed me about about obstacles is they're pretty honest about what machine learning can bring. I hear a lot of things about how machine learning is going to make our job easier and I felt like with connected car we had a really good use case. On on working with machine learning and not getting those kind of useless alerts that say, hey your CPU spike 300% and it went from, you know, point oh one to point oh three like. I don't care about that that sounds like a service doing its work right I have plenty of CPUs that's not an alert, I get it's an anomaly. Who cares. Connected car had a great use case where we're ingesting lots of telemetry from different providers that don't mean open telemetry like Prometheus metrics I mean like vehicle information vehicle telemetry. And it's really easy to tell when a provider is down. Hey you're not getting any data from forward you're not getting any data from GM, because the data in is gone to zero. It's a lot more difficult to spot. Hey your data seems like 20% off of what it usually is, where usually takes into account the kind of sinusoidal pattern of our telemetry throughout the day as people are renting vehicles returning vehicles driving shutting them off. And so having you know machine learning plugged into that to be able to tell us. This is slightly off that slightly off it's way easier than having to look at a graph and decide that just super great because that is one of the things that we were doing back in in the VM days is like we were literally looking at a graph on a graph and going. Okay, when I zoom out the sine wave is off here and I don't know why. Quickly before the demo and running out of time here about the features that I really like and give you a few real world examples. We talked a little bit about tracing and whether or not teams are ready for it and how the ops crews at map has kind of served in for us in lieu of super involved distributed tracing. You know if you have something where you expect to services to communicate, you can see latency on that we had something migrate from on prem and they used to have their database right next to them. And they were really concerned like hey when we move away and the database isn't right next to us and we have to call to RDS and that's a managed service. What if that latency is too high. And we were able to look in there and say like well here's your latency to RDS it's you know, milliseconds or less so you're going to be fine and they were able to see real world. It's fine and not just have somebody say, yeah it's going to be fast enough like we have it's private don't worry about it. So it's really cool to be able to prove that kind of thing. If you have workloads that are like burdened by some some resource constraint that's non obvious so Kafka partitions, for example. If you have a container come up and it's connecting to a Kafka topic with 10 partitions, and it wants to use eight of them. Well that's fine. If somebody says hey that servers, you know that that containers under a lot of load I'm going to scale it out I'm going to put two or three more containers on there. Well they don't realize that they're bound by the number of Kafka partitions they won't know that at least one and a half of those containers aren't doing any work. There's no more partitions for them to attach to. And so we're able to kind of see that in the app map that like oh yeah you've got a bunch of these things running. And you know, maybe you could say that the pods readiness or liveness should catch that but depending on how it's coded it might not, and being able to kind of see that those connections aren't happening is really awesome to just kind of anchor discussions and what's happening and not what we think might be happening. Kubernetes issues is great there's a node map that the Caesar will show you the machine learning there is really reliable it kind of lets us know if we have nodes that have problems. We have some workloads on spot instances. So it's cool to kind of see disruptions in there and just kind of a general, like accounting for for all of our stuff that's in Kubernetes. People who aren't used to Kubernetes maybe like you know for people on my team or people who are embedded in the dev team, they're used to running kubectl commands but a lot of people aren't obstacles as SSO. Once that was enabled we could kind of send these links to people draw app maps and show like here's what's going on here's what's going on check these links. And that was awesome they didn't need to you know navigate through Grafana or know how to make kubectl commands to see what was going on with things which is awesome. Time travel is a feature that like, I really wanted to put on the slide like it works because when I first saw it, I don't think I'm a negative person, but I was like there's no way that works. And it works like amazing. We've, we've had containers where we were seeing evictions. We had a container with really spiky workloads, and we knew that like nodes were having eviction events but we could not pin the original culprit like what was the thing that was using all the memory because whenever we looked. The eviction had already happened and everything had been rescheduled like where was the reason that it first started, because sometimes the first app that's evicted is not necessarily the app like maybe something didn't go over its limit it just used too much on the node and we were able to kind of look back we realized it might be this workload, and we were able to look back in time travel and say, here's the moment that it spiked. And here's when all the eviction events happen. We isolated that workload onto its own class of nodes and those problems went away completely. So are there other ways to accomplish that and get that same information probably. But the time travel is super cool. There's another kind of quick one that's on the slide where you get an ECR outage and the downside about ECR outages is like, you can't pull new containers. It's running is fine. But if anything restarts was redeployed and gets rescheduled somewhere, it's not going to be able to pull the container. And so we had that happening to a couple workloads during the outage and we use time travel to see where they used to be scheduled and just pin them onto those notes so that we could take advantage of the node cache and kind of avoid an extended downtime issue. I really like that we're able to look at a high level at the platform to answer generic questions like, well, how are things spanned across on these different availability zones. That's really useful for us. And with Ops screws like we've really found a partner that understands all these open source tools from the very first parts that we paired with them. We could tell like they were in this as much as we were. Sometimes other partners like they know where their tool touches open source but as for like, how does Prometheus work like what's the most optimum Prometheus configuration. They may not know, or it depends on who you talk to like some people will really know some people don't really everybody at Ops screws that we've paired with anyway has been super super knowledgeable about the underlying open source foundations. That's been awesome for us it solves that kind of day to what are we doing all these open source tools questions. And I want to hand it over to Caesar now there is one more question I'll answer which is the idea of the number of teams involved in developing this and the total number of people. So my DevOps group is relatively small. We build up this platform with connected car so not including the connected car development team. I think we had like five or six DevOps people. Today, we have like 12. And there are lots of application developers, like hundreds of application developers on these platforms now and that spans at least a dozen teams at Avis and growing for for different groups that are deploying into this platform today. Awesome. Well, thank you very much, Matt. Thanks for thanks for sharing that, you know, a lot of that is your transformation guys are going through. And thanks for a lot of the kind of kind words that you said about about Ops, we really got to hear that and you know it is it is very awesome to get to work with with with Avis as well. You know, other the teams there just sort of based on the Avis is really, you know, it's probably my ignorance but when when I first started work I didn't understand the level of complexity and technology that really was involved with, you know, with Avis themselves and it's been really really awesome getting to work with with the Avis team and and and and with people that really know the technology and that really were leveraging open source and spoke our language as well with with Kubernetes and modern so it was awesome. It's a great relationship. Now, I am going to share my screen as well and we will continue talking. I think there's a couple of questions that have come in. I think we'll leave those towards the tail end of the Q&A session that way we can actually show you guys what we're talking about in an actual demo but let me let me share my screen here. Okay, I want to give you guys a tour of what obstacles actually is what we do and how we do it right so we are a essentially a modern observability platform, built with the idea that you don't need to come in and recreate a bunch of agents and find the pooling data, you know, especially as it comes to these modern workbooks like Kubernetes and cloud, you know, the cloud the cloud providers provide access to the data to their APIs, and for modern platforms such as Kubernetes right which you know has been kind of, in my opinion is practically winning the orchestration or at this point. You know, the whole issue of how do I pull data has really been solved, there's no need for companies going and finding ways to get metrics from containers, you know, see advisors doing that you don't need to go in and develop a new way to get data from nodes you have to be describing all the data from the exporters and having it in a single place, and now you can leverage all that data right so we didn't sell to reinvent data collection what we're really doing is doing two things we are unifying that data in a single place with some configurations and metrics, and, and network data, and, and then what we're doing is we're actually building models around those entities and learning what is normal operating behavior for those different entities and telling people when there's something wrong with their environments when their applications. So some of the some of the new challenges I do want to call out. In modern environments as companies move to cloud or cloud like even if they're in if you're in data centers but you're moving to modern development practices and release practices. It's still very much applicable you have complexity, you have a lot of dependencies on different pieces that are out of your control you might be depending on third party applications or simply microservices that are part of somebody else's team and that you really can have any say over other than hey this is maybe not working, please take a look at it right, and then the dynamism you know as we were talking about releases, and all the changes that are constantly going in you know you have many, many people that are deploying, even many times a day into production right that used to be, of course, an unheard of now it's becoming common but again it brings its own challenges of well what changed, which release broke my which release broke my code or broke my application etc. So, and then the big challenge with, as was mentioned earlier is the disjoint among right you have all this data, you have some platforms taking in your metrics and you have some platforms taking in your lawns and others taking your traces and configuration data, you might not be doing anything around that you have another, you know, for for network data and understanding what's talking to what all that data is out there and many of the companies that we talked to have that data, but it's disjointed My like my one of my favorite words is like context right when you have an issue when you're troubleshooting a problem context is king right if you don't if you're looking at a bunch of dashboards that are somehow related. You as an expert or or the people that you're managing as experts need to be aware of keep that context in their head of what's happening across a different dashboards that itself is a challenge and that's something that we also are just hoping to solve by bringing all the data and that single place with context that way you don't need to have everybody be an expert and remember everything while they're troubleshooting right. So, what kind of data do you need for this right so you're going to need application structure you're going to need to understand what dependencies are having between, or what dependencies exist between between the different entities so not only the service to the service but you know you have a container what orchestra you know what orchestration node is it running on, and that node is it running in the US is running GCP where is that was that piece running. How do my different, you know, from the application level how does that relate down to the infrastructure level on top of that once you have that kind of data you also need curated knowledge. So, we kind of call that concept this is me in a box right being able to tell you, you know, when something is wrong and and where it's wrong. There's a lot of things and I'll show you in a sec we don't require teams to go in and select metrics and set thresholds. That's where that curated knowledge portion comes in because we, we as a platform provided that data without you needing to have experts on absolutely everything. So after that data right you also need application state. How are those applications self behaving are they up are they down. What kind of, what kind of traffic are they receiving so all that we, we, we also take into account on our platform, and finally you need the up understanding which kind of comes in from the app state once you have, you know, what are my traffic levels. Am I up am I down. What am I talking to, then that's where the ML piece comes in and we learn what which of those behaviors is normal, what is your application typically do how does it behave at 50 requests. How does it became a how does it behave at 1000 requests what is the CPU doing what is the memory doing what is your file system doing at these different levels of requests and types of requests so all those pieces together are important for when you're building a smart layer. So, again, that is what I'm going to set out to do but how do we really get that data. Well, this is how we get that data right so this is the screen is essentially showing the different tools that we can leverage and there's others right but these are kind of like our four tools every leverage for bringing in this data. So you'll see that number one we support the open telemetry standard, right so we're, we're building out increasing support for open telemetry as, as they're. As their collection pieces become more GA. But, you know, the all the things that we are also taking in our things from the CNCF is Matt was mentioning earlier things like Prometheus data for metrics right from log so low key for traces. Again, kind of this kind of a mix of the open telemetry standard we use Yeager as the back end but really anything that's open telemetry complain on your front end. It doesn't matter if you use a mixture of like Yeager and open Zipkin and open telemetry libraries for instrumentation. You know that's all supported for flows we take data from Istio but also we take data from ebpf right so we leverage ebpf to look at not only what entities are talking to each other so you know container to container calls but also. We're able to get URL level data again still without tracing I'll show you guys that in a second but really once you deploy obstacles, you can see what is our app map on a single place without the need for trace though again we do support tracing. We take data from Kubernetes right so we take the config data and I'll show you that in a second as well. We take changes from different CI CD platform so Jenkins pinnaker and other places right and then data finally from cloud. So again we go to the cloud and we we bring back the different entities that are powering your environment so you might have any chaos cluster and AWS that's being run by you know. 2030 50 nodes maybe more right that are maybe even auto scaling in and out all the time kind of like a this is environment and absolutely you know we support all that data. This is a quick view of our architecture, as mentioned earlier right so we take data from Kubernetes we take data from cloud we take data from containers we also support virtual machines. And also serverless which is kind of you know bundled into cloud here abstract away but we also just support serverless. So we take data from all those and really we leverage as we mentioned these open source platform so looking for logs. So we take data from ETS for metrics and the Yeager back and for grabbing those traces from you see the containers and virtual machines feeding those platforms on top of that. You have Kubernetes itself feeding data and as I mentioned cloud to our gate leads right so we really have five types of gateways. I know this is getting a little bit into the weeds so I'll just speak to it briefly. If you guys have any questions around it we can talk more about it but we have a metrics gateway so that's going to take data directly from from ETS. And the base gateway that's again that middleman between the between the Yeager platform and our offscrews cloud and the Kubernetes gateway again the middleman between all that Kubernetes data and object discovery etc, and our cloud. Finally, sorry cloud gateway, which is also going to do that discovery of those entities and cloud configuration and cloud it's in the data often to obscure sass and finally the log gateway which is going to be interacting in this case, in this example with low key. And so these gateways are super super lightweight containers, a couple hundred mags in memory and maybe like a quarter of a CPU super lightweight pods and it's it's one per telemetry type they're mixed and matched. You can if you're not if you say you know what I'm not using tracing and that's fine we don't deploy the trace gateway. But the point is that we have the super lightweight super secure containers that are just going to take that data from your existing platforms and send it off into obstacles. Right. Any questions about architecture we can we can discuss. It's part of the Q&A. You know, there's a there's a couple of things but just in the interest of time I'm actually gonna I'm actually gonna skip over a couple of these slides. But I do want to call it some of the some of the types of problems that we do resolve so things like application slowdowns and app crashes and misconfiguration of different entities. So, look at improper balancing misconfigured load balancing connection pool saturations for different databases. We also do a lot of, you know, a lot of the observability platforms know are starting to do support for Kubernetes but we were built from the ground up in the Kubernetes world and Kubernetes environment so we do a lot of you know the Kubernetes layer problem detection so things like missing config maps causing containers and pods not to come up. Node evictions because of you know let's say you have a high high disk utilization and your node starts kicking out your your pods you know will detect that and alert you on that. And then, you know for cloud infrastructure availability of the nodes and of the volumes and again the load balancing performance also applies here to to cloud. And as mentioned earlier serverless so you know abnormal performance and cold start delays and start up issues, etc. So all those, there's a lot of different types of problems and what I'll do is I'll actually jump into the demo here. And that way you can get a real taste for what obstacles actually does. So, this is our obstacles demo environment right and we do have support for dark mode which actually I think this view kind of lent self a little bit better to dark mode. You know by default, by the way I have some grouping going on here. So by default, we collect a bunch of different rich data about these entities so not only the monitoring and observability data, but also the configuration so if I click on any one of these entities. Right, so for example, actually, I should do a quick tour, we have things like pods right these little stacked squares are pods. If I double click on any one of these entities, I can actually see the containers that are running inside of those pods. As mentioned earlier, we grab a different cloud this is an elastic load balance that's running in the cloud. These are the Kubernetes services we also have calls out to third parties. In this case, obstacles from this environment obstacles is a third party that we're sending it out to as mentioned earlier in the architecture, and also things like calls directly from the internet. We also have an RDS instance somewhere here we go we have an RDS instance, and each one of these entities so if I click for example on that RDS instance right I'm in the context of the RDS instance. I can see the configuration for that particular RDS instance so I can see the allocated storage and the port that is being exposed on the subnet that it's part of. And you know for maintenance, this is a bunch of data related to that RDS instance. Now it changes when I look at for example this redis pod so now if I click on the pod itself. Now I'm looking at made a data for the pod right so for this red spark pod I can see the labels attached to it. We pick those up automatically and we leverage that for the grouping that you saw a little bit earlier. So here's what he addresses again ports that are being exposed and then we can also look at metrics logs so I can drill down actually let me find something a little bit richer but for example. Let's say no Explorer right so I'll just I'll just pick on what it's for because I know it has logs and stuff. So no Explorer. Again, same thing I'm picking up labels and different pieces of made the data. I'm looking directly at the entire manifest so if I click on detail view, right I can now I'm looking at the whole manifest for this particular Kubernetes workload right so what Matt was mentioning earlier that now teams don't need to, you know, teams that are don't know how to use qctl. They don't need to do a qctl get pods dash oh YAML and then what you they don't need to do all this stuff now they can just say, Well I'm looking for no Explorer okay here it is all right well what's what's the running config oh that that's where it is. And then from here you can also drill down into for example metrics right if I'm going to look at metrics I can go directly into that. And now I can see the CPU utilization and memory utilization for these particular sets of pods. And also one of the cool things so for example when you're trying to find out what you're talking to let's say I'm actually going to find the Prometheus pod because I know that it's talking to a lot of different exporters. And so here's actually our Prometheus pod right. If I click on Prometheus. I'm looking at Prometheus in that context, and I can also stick config maps attached to it by the way so now we're using the config maps and this is actually you know the scraping rules etc, but I can click on connections, right and by the way I'm going to switch back into a white one. I can click on connections for Prometheus. And now I see everything that Prometheus is talking to right. So Prometheus is going out and scraping these different targets and I can see what ports are being scraped out so I can see it scraping node exporter over port 9100 so I can look at all this data. I can see what the different you know the level for data about invites out by its packets and packets out and also where supported level seven data so I can see let me see average. I can see a advisor taking about one second on average. So all this data is available without tracing right so again we do support tracing but this particular view is built. Without tracing we're leveraging EPF so in about a minute or two after you deploy obstacles, this whole right which you can play around with, you can play around with the organization of this, but this whole map, right is built automatically leveraging and it's updated in real time. So, as you deploy different workloads into your clusters and into your environments you can see these different services are acting with each other over reports of a latency and that builds into the rest of our anomaly detection. And so I'll actually just show that here in a sec here. So I just want to give a tour of that. Now there are different pieces so for example our cloud. Actually, I'll go a different one so let's let's say this is an eks cluster by the way inside of inside of AWS for this particular demo. But if I click on a pod, I can actually look at a pods what we call our actually just a container, a container is the three layer view right so what we have is this three layer view shows us the application levels of the container. So the container is running on top of this Kubernetes node, right and I can see the node name that it's running on some of the neighbors on that note, but in turn, this node is running on this EC2 instance so I can see that there's this instance called Lab 2 node for and I can see the EC2 type in the region. And if I click on go to him for map. This takes us to what Matt was mentioning earlier so now I'm in the context of that particular node. I'll kind of push that away for a sec. I'm the context of that note and this is what Matt was talking about like where where does your data live right are your entities living in the proper submits and the proper regions right so we will call out and map out right you you're running inside of AWS and in the US West to region and the US West to availability zone. You have a VPC here with a couple of different subnets and all these are EC2 instances so this is this is constantly updated to reflect the latest state of all your your workflows that are deployed inside of AWS and I can click on this EC2 instance and look at all the metadata attached to it just like you saw for the pods and it's the root device name I can see the status whether it's running or not and then I can see metrics and I can draw them into metrics just like we did earlier. So all this data, all this data is available and again constantly real time. Let's see if there's a bunch of questions here. Let me take a quick look here. Do we have the desire to monitor the contents of more detailed packet data that cannot be obtained by services, such as OSS monitoring tools and data for example, by services such as more at the like actual network like between switches and routers that is not the core of our application. It's much more at the application level and the services interacting with each other. If there's an interest for that I mean we'd love to talk about that we can probably take that offline shown but but yeah currently we're not like getting that deep into the packet inspection one point yet. Let me see which framework technology is the options web app built on that we can probably take offline and I'll have to talk to the team I am not sure I know we use a mixture of a bunch of different technologies. I know there's some Angular in there. But that's about as much as I know we're with the actual front end that we're looking at is built on. I'm going to link through obstacles directly to the service interface UIs for some of those services such as Prometheus. Sorry, and I'm not sure I fully understand that question if you want to clarify that a little bit more. I'll be happy to answer that. And then what is a pricing model for obstacles. I'm not the guy that talks about pricing and that's not just me dodging the question I really am not but if you, if you want to reach out through the if you go to obstacles.com there's a chat box on there, or if you want to email the info at obstacles.com. We can happily answer that question but we do charge by the end or by container, and it's all in one pricing I know we you know it's not like different features are our price differently so so it's all you get all full featured for the same price. That's a good question about how long our log stored. I will answer that in just a sec because I know we're coming up on on time I will answer that I do want to show one last piece before. And because I know we're pretty much out of time, but I wanted to show one of the cool things that we do with the ML right so let me. So I want to show this particular problem this is our problem screen by the way click on alerts here and this is our problem screen. I filtered but you know there's there's lots of different types of alerts here that you can see on just. But there's like response time breaches, and these are some some ML alerts that our machine learning has detected some deployment issues. So for example like this payment server is not coming up. We can see your replicas only two are available, and we can tell you things that you know why they're broken if you click this analyze tab. So you're really cool fishbone feature, and this fishbone RCA feature tells you what's actually broken so in this particular case, you have some run time failures and some startup failures particular on pod scheduling, and just because I can see it right away I see this insufficient CPU right. So this workload is trying to be scheduled there's not enough CPU here it's requesting 2000 of course, and there's not enough for that. And that's the reason why this isn't coming up. But what I wanted to show. Again, because I didn't know where I was at the time is this particular right so we have an SLO breach on this engine next service, right. And there's a bunch of data here as to, especially for if there's any ML people out there there's all these details on to why this triggered and but the fun view is here and analyze. Analyze view does is it shows us a slice of the app map right nobody built this this map, nobody came in and said any thresholds on here. What this does is automatically detect that there's some violations here. Then we look upstream downstream up the stack down the stack to find if there are any issues. So really, this is the root of our non loops. This is the root of our normally this particular engine x failure it's at 15 seconds right. Now, obstacles is automatically detected that there's some issues downstream in the red that are causing problems right so I can click on them and see what's going on so hard cash has this spike in red right you can see here. All of a sudden, you see these different metric types that have violated and you see like response time increasing drastically and we'll call out the values and the anomalous value you know normal what's anomalous with that percentage increases. And we can response errors looks like so we found from zero to 20 years with all the transactions. And so there's, you know, we call these pieces out, but you know this is that car cash or what's going on downstream that's causing that issue well cart server. So that would make sense that we're suddenly getting a bunch of errors. And finally, if I go to this final piece this final leg, it looks like we don't have a pod that's able to serve because we have an invalid image check right somebody's using an empty image with a default tag which is never going to load. So we did that all automatically I know I know that was kind of rushed through but we did all automatically nobody came in here sending thresholds nobody came in here and said to me, rules, this is just something that the options platform does out of the box. With the combination of all that data we talked about earlier on top of you know the machine learning that's in place. So, with that, as I said I'll hand it back to the questions now that it was asked, how long do we keep the logs. And the reality of it is, I think logs are stored for most data by default is about 15 days okay, we're not meant to be the long term store of your data. That data is already living in and I'll come back to this, your logs are living in Loki right you have full control of your logs. We're not going to go in and say yeah if he does your logs and then we'll charge you you know 10x with cloud storage costs. And that's something that Matt was alluding to earlier right you guys own your own data right man, you can attach as many gigs of storage as you want to Loki, and go back 50 years. You can log if you want obstacles is really leveraging that data for real time and your real time analysis of your problems of the of the issues, so that you know we tell you hey this is what's wrong. This is what's broken, and you know you can you can drill down contextually into logs right when you're looking at a problem right in this context of this problem you can click on logs. You can look at configs you can look at events, etc in the context of a problem, and but about 15 days later then will will will hold those logs but you still have them available fully inside of the Loki platform and inside of Prometheus from metrics. So hopefully that answers your question. That's definitely one of my favorite things about working with you guys is like that mentioned earlier you really get an embrace the open source side of it so like, we can do the kind of log storage stuff that we may need to do for compliance or whatever. We can keep what we need without having to like pay double for that, which is just really awesome. Keep control of those things and, you know, the same thing with like custom metrics right I don't have to think like oh no I made extra custom metrics and that's going to change how much I'm being charged like, no, it's just Prometheus metrics. So what are some of the few monitoring vendors that that I have found in my experience that like doesn't try to lock you in, which I think is cool. Yeah, and that's exactly that's exactly the philosophy right where that's the philosophy I think of open source and not being locked in and also the philosophy options that we continue. So right now options might be the best solution if at some point you need to pivot away we hope not but you know the capability is that right we're not locking you in and say nope now you move away from obstacles now you lose all your you know telemetry dating you have to find a different way to monitor and collecting that's not over about. I know we can continue answering questions with Jesse you guys know we do have a free version of obstacles that supports up to five Kubernetes nodes. If you want to sign up just head on to obstacles.com you can sign up there. There's another question that came in which alerting ticketing integrations are available. Service now slash Microsoft team is basically. Yes, all of those and more right so we support basically any modern notification system you can do SMS. We can use up to change. Yeah, yeah, they use obvious abc's option so we can send out to option service now etc. So really any any modern. Any modern platform with with an API email as well if you're so inclined. So yeah, email alerts are just so wrong. The world moves on email alerts I thought that's what makes the world turn maybe. Any other questions. This is me if I set one year for log storage. I can search and analyze that data to the obstacles you I know in the obstacles you I know we will call that data we take a temporary copy of your data. And again about 15 days later, it's gone but again the underlying data at Loki that's running in your cluster. By the way, if you guys aren't leveraging any of these tools and you have a brand new kind of green field environment. We'll deploy these these underlying tools for you right we'll deploy Prometheus we'll deploy low etc. But those endpoints are accessible in your cluster you can go directly to the Loki endpoint and and have that or in Grafana. Or in Grafana as well yes. We usually do long term log analysis and like pulling stuff just through the Grafana Loki integration. We really view upscrews is like our, our real time, you know, anomaly detection like first place you go to debug an issue as opposed to like, like Caesar said like the source of truth for long term storage of these things we kind of take that on ourselves, based on our own compliance requirements. Okay, thank you so much to Matt and Caesar for their time today and thank you to all the participants who joined us. As a reminder this recording will be on the Linux Foundation YouTube page later today. We hope you're able to join us for future webinars. Have a wonderful day. Thank you so very much.