 Hello and welcome again to the OpenShift Commons briefings. We're really pleased here to have a pervade from Cystec going to give us a talk on best practices for monitoring your OpenShift PaaS. I think actually PaaS is a dead word now. I think we call it a container application platform or a cloud application platform, but yes, PaaS, we're PaaS today, and we're really happy to have you here. It's going to be sort of a demo rich presentation I can tell today, so don't worry, you don't have to go and change it right away. That's perfect, but I'm going to let- I don't know how old I am, that's how we do this. There we go, live, a CI-CD workflow, live. All right. Anyways, I'm going to let a perva introduce himself and take it away, and you can ask questions in the chat. One of his colleagues, Nox, is on the chat, and we'll try and save most of the Q&A for after the presentation. But if you have a pressing question, please raise your hand and we'll interrupt him. Take it away. Good morning, everybody. Or good afternoon or good evening, wherever you are. Thanks so much for joining us today. My name is Apoor Vadave, and I'm really excited to have you along with this ride on best practices for monitoring your OpenShift container application platform. As Diane noted, I think talk is cheap, PowerPoint slides are cheap, so we want to do as much demo as we can here. We're going to be demoing a real OpenShift environment. We're going to be showing you SysDig alongside of it, and really what I want you to walk away with here, regardless of whether you're interested in SysDig or not, what I'd like you to walk away with is a few key best practices that you can take and apply to any OpenShift environment to help you understand how you really operated in production, and how you get the intelligence and visibility you need to really understand what's going on. So a couple of things here. First of all, I love to talk, but I love to answer your questions more than anything else. Please, please keep firing those questions in. That'll make this much more entertaining. We'll try to answer them in line if we can, and if not, we'll take them out of band and we'll answer them offline. A couple other things here. We'll be doing a bunch of demos, and they're all live environments, so fingers crossed. I'm hoping the demo gods are with me this morning. Now, with that, let's dig in, and I want to start by just kind of pointing out that what I'm going to talk about today is actually not one big challenge, but the convergence of two challenges. Operating containers in production is a huge challenge in itself, and as we'll talk about in a moment, the technical underpinnings of containers require you to change much of the tooling around your deployment and operations process. But there's a second challenge, which may not be obvious as you start fooling with containers, but if you are looking to build a true platform, and if you were looking to build a true container platform, then the process by which all of your platform consumers monitor and troubleshoot their environments will be really different than what you've looked at in the past. So you bring those two changes together, and it's like a little bit of a science experiment. You put the two right chemicals into your fake volcano, and you watch it overflow, right? All right, so before we go deep into technical details, so I have a usual gimmick. I'm not doing my usual gimmick today. My usual gimmick is that if you can stump me with a question, then I'm gonna send you a cool T-shirt. But today, I'm just so excited to be here with the OpenShift community that I wanna send everybody a T-shirt. So if you wanna drop me your name and email in chat, we'll make sure you get one of these T-shirts. If you give me the mailing address, great. If not, I'll just email you directly. We'll get your mailing address, and we'll send you one of these shirts. Sweet, so as again, find the chat window, send me your name and email. This is kind of my way of also getting you to use chat first, so that then you don't feel so shy asking me questions. Right on, here we go. All right, so let's talk about the first challenge first and the second one second. Do you remember the first one? It was simply about monitoring containers. Why is it different? Fundamentally, microservices and containers, you will find break your legacy and monitoring and analytics tools. And this is the origin of Cystig, and we're the first company that can natively monitor any infrastructure, including container-based ones. I wanna go into another level of depth here about why containers break these legacy tools. In order to do that, you have to understand a little bit about Cystig's history. So the creator of Cystig was one of the co-creators of Wireshark. For those of you who aren't familiar with Wireshark, it is the premier network analysis and sniffing tool. It has about 20 million users in its open source community. And Loris, the co-creator of Wireshark, realized, as containers were coming about, that there was an entirely new problem here to solve in terms of not only how you understood what's happening at the network layer, but what's happening at the application layer as well. And so he launched Cystig's open source project in 2013. You can go to Cystig.org to see our command line Linux visibility troubleshooting suite. On top of that, we launched a commercial product that helps customers visualize, monitor, and troubleshoot distributed environments. And that's what we sell. And not surprisingly, part of the reason we're here, we have deep integration with OpenShift. We're in use at major enterprises and we're OpenShift-primed. Hooray for OpenShift. All right, so now let's go deeper. Why did Loris, when he created Cystig, see a challenge with containers and insight into containers? And here we have to talk a little bit about what containers are and why they work the way they do. I like to think of containers fundamentally as a black box. And a black box could be really good or not so good. Black boxes are really convenient for development. This abstraction layer or this ability to wrap up individual processes inside this lightweight form of virtualization is really great for portability. It's easy to take your environment that you're running on your laptop, bring those to your cloud environment or your container platform and get the exact same experience running your container. Awesome for developers who are trying to deliver reliable code into production. But that black box comes at a cost. How do you see inside that form of virtualization? The second big challenge is that as you move containers into a production environment, you have some form of an orchestrator typically that's moving and scaling these containers without intervention. And this is the key part, without intervention. So you don't have a human who's sitting there all the time saying, well, looks like we're getting a lot of MySQL queries, maybe I need to triple the number of containers that we have. You don't typically have that kind of experience, you're looking to automate it further. The next step here, which I kind of alluded to is you then have services that are built on many distributed containers. So if you're trying to understand again how your MySQL service is doing, that's doing a lot more than simply understanding what one container is doing. Now you have to understand what all three or what all five of your containers are doing, which by the way, may be on separate nodes or may even be in separate physical data centers. And you may not even know where they actually are. If you're, say, the developer who's just consuming a container platform, you simply wouldn't know. So these challenges make it very, very difficult for your average monitoring tool, like last generation monitoring tool to kind of instantly give you the information you need. So while the black box is great for developers who are trying to bring new code online, when you think about the operations aspects that those developers need to deal with or the operations aspects that your DevOps team needs to deal with, you need actual visibility. You need to see what's running inside the container. You need to be able to aggregate across containers in order to see what information is there. And you need to be able to react to not only what the containers are doing, but what your apps are doing. So let's talk about how we address this challenge. There are a couple of ways to address it. One way which I'll talk about a little bit later is what I call sidecar containers. And that's the idea of taking every single pod that you deploy in OpenShift and putting a sidecar container in that pod to share resources, share fate, and kind of do some internal networking to try and collect information. It's pretty complex. And again, this is where one of Loris' insights were when he founded Sysdig, which was there's a cleaner way to do this. And it's by providing a new form of system instrumentation that we call container vision. When you deploy a Sysdig container on a host, we can automatically tap into the system calls of that host and we can see every single system call that every container executes. That means we can understand what the application is, what it's doing, and we can understand the underlying host resources as well as network resources. So from the single point of instrumentation, you can see app, container, host, and network. That's pretty powerful. So I'm gonna go down one more layer here. This is our hypothetical host and operations with the Linux kernel down at the bottom. Well, you can tell this is hypothetical because it seems pretty unlikely that you're gonna take a host and you're gonna run a couple non-containerized apps, a Docker container, a Rocket container, and LXC container. But humor me, humor me for a second here. Okay. So now we see that obviously all of these containers would be interacting with the kernel in order to execute commands, whether those are accessing underlying resources or guaranteeing their own resources. And it's there where Sysdig inserts a thin layer of instrumentation so we can see these system calls. And from these system calls, we can then derive a significant amount of information. We can derive host, network, and container metrics, but we can also see custom metrics. That means custom metrics that have to do with how the application is performing as well as things your developers have instrumented like users abandoning a shopping cart or a flag when a particular process is hit within your application or a particular function. Based on all of these system calls, we can also automatically determine what's running inside the container. So just looking at the system calls without any human intervention, we can say, oh, well, look, this is the custom node application, this is the API server, this is the MySQL database, and so on and so forth. Once we know that, we can tap into the MySQL database to collect metrics it naturally exposes. And so we auto discover these application metrics without you having to interfere. And you can imagine the power here when you're operating in a true container platform mode where you may not even know exactly what all is being deployed on the platform, but you're responsible for guaranteeing visibility. So that application auto discovery process is critical to instrumentation because it's not enough in today's world to simply monitor container shares and host CPU. Instead, what I need to know is how well are my endpoints performing on my API server? How slow or fast are the queries to my database server? And what's going on with garbage collection in my Cassandra database? These are the types of metrics that really help you monitor and troubleshoot your environment. So here's a quick glance at all of these auto discovery mechanisms or auto discovery application components we have in place. This is all customizable. So if you have a different app, if you create a custom application, you can customize Sysdig in order to collect that information as well. All right, so let's keep going here. So now we've instrumented the kernel. We're seeing these system calls. We turn them into metrics. What do we do with them? That's where the Sysdig container in user space actually comes into play here. We pipe all that data to this container on your host where we can aggregate the information and we can send it up to our software. We're running that software either as a cloud service or you can take our software on premise and deploy it in your environment. And that can be your cloud environment or your private cloud or your data center. Doesn't matter to us. Either way, you get the same experience of our monitoring and troubleshooting application, which of course I'll be showing you in a moment. But the key here is we can aggregate this information across thousands and thousands of hosts, tens of thousands of containers and deliver it to you in whatever form you want, whether that's us operating the service or you operating the software. All right, so it's time for our first demo. We're gonna do a really quick instrumentation demo because our instrumentation is so powerful that I love showing it off. All right, so here's what we've got. Let me make sure my terminal is operating correctly. Diane, quick check, just making sure you can still see my terminal window. I can and that font is just barely big enough. Let's step it up. There you go. There we go, okay. All right, so while I'm doing this demo, right after this demo, I'm gonna stop for questions. So if you do have burning questions, let's use this as an opportunity for you to get them into chat right now. Okay, so here is my blank environment. This is a very sad environment. There is no data here to visualize. So let's fix this. We come over here, come over to agent installation and you see here we've got some pretty standard installs, Docker, Linux, Kubernetes. Kubernetes and OpenShift work very similarly. Here's our OpenShift install and kind of just link you here. We're gonna do, for this first one, we're going to do a raw install, okay? So all I need to do is copy this Docker run command, come back to my terminal. You can see here I've got some containers running. I've got a small WordPress application backed by MySQL database. Great, and now what's happening here is Docker is running and the stash D flag is basically saying run it in the background, the Cystic agent. Now, here is one of the key trade-offs you need to make in order to get this great visibility that I described, which is we run in privileged mode. That gives us access to install our kernel module and see all of the data we need to see. As I'll show you a little bit later, we can isolate this Cystic container through OpenShift to restrict and limit the exposure of that to your overall team and container platform. All right, so now let's head back over here and we'll head over to explore. And wouldn't you know it? Here we go, there's my IP address and I'm starting to see CPU memory network data from this host. That's nice. Honestly though, that's pretty standard, right? This is kind of what you see in any monitoring environment. Now our Explorer tab is kind of like Htop for your distributed environment if you're familiar with Htop, a really powerful tool, but we can drill down here and not only see a host, but we can see the individual containers. So I've been talking a lot about MySQL this morning, why don't we just continue that? Let me click on MySQL here and you will see that. Let's start with an overview by, let's start by an overview by process here. And you can see we can drill down in this individual container and see memory, CPU, so on. But what's more interesting here is this is SQL, right? So let's click on this MySQL view and let me expand this for you. Now we're looking at very much MySQL specific information. So you'll see here, you're gonna see top queries by number of requests, top tables by number of requests, slowest queries, slowest tables, and even request types, select and set and so on. So what you're seeing here is of course with one single instrumentation command, I'm now able to tell you what are the slow queries running within MySQL. This is a really powerful approach that our instrumentation enables. So we're able to see your container, look inside your container, understand what's running, grab the relevant data, combine that with underlying host and network data as well. All right, so that is the end of demo number one. I assume if you were in the room with me right now, you would be applauding, either that or you would be so stunned, your jaw would be open and you didn't know what to say because I'm willing to bet you go to very few demonstrations where people actually show you the grungy process of instrumenting a host in instrumenting containers. Diane, let's take this moment to stop and see if we have any questions. There's one question here, actually there's two. And Dale, I think is asking, I know you can detect and display K Kubernetes events. Are you going to be adding the ability to alert on specific Kubernetes events? Oh, roadmap question already, I love it. That's fantastic. Yes, in fact, I can confidently say that our design and engineering team are working on the ability to alert on events as we speak. I'm sure my product manager would kill me if I gave you a date, so I will avoid that and I will instead say it's coming very soon. All right, and then there's another one from Steve, saying, asking, Sysdig is based on a kernel module. When we patch the kernel, do we need to reinstall the Sysdig agents? Ooh, good question. So we use a function called DKMS, which allows dynamic compilation. So in general, you do not have to reinstall. We'll take care of any updates that we need to. I say in general, because if there is a huge breaking change, there may be some modification we need or some manual intervention. That is very, very rare. All right, and there is one more. Oh, no, there's two more. Okay, they're coming in. Do you need credentials to get the MySQL-specific specs? Ah, good question. You do need credentials to get MySQL-specific specs if you've set up your MySQL database that way. And a relatively new feature we've added is basically a set of environmental variables that people can include as they deploy their software that allow us to access the credentialed container appropriately. So that's really perfect for platform as a service or container application platforms where the operator may not have access to the container itself or the credentials themselves and needs the developer or owner to provide them. All right, and there's one that Knox did answer in the chat, but I'm gonna read it out just so everybody else can hear it too. Is there one SysDig container per app container in the same pod, or are you deploying only one SysDig container per project? That's a super question, and in fact, it's a perfect lead-in for my instrumentation best practices here. So let's talk through this, and that point is actually my second, the question raised, I should say, is my second point, which is the best practice that you should have in mind is a single container agent per host, not per pod. The per pod model has been espoused by a lot of people because frankly, it's a lot simpler for the people who are creating monitoring software, but it is a big negative in my opinion for people who are operating environments. The per pod model, and to be clear for those of you who don't understand that, when you deploy or define a pod in OpenShift or Kubernetes, you can define what I call a sidecar container that would be deployed with every pod, and that sidecar container shares fate with your application in that pod. That means if one of those container dies, they all die. So with this per pod model, you have higher resource usage because every pod needs a sidecar container. You have higher complexity because you've got increased networking within the pod. You have an increased chance of breakage if one dies, they all die, and you have an increased attack surface. There are more containers that someone can try and hack into. So when you do that, when you flip that around and say, look, I only want a single agent per host, you reduce all of these issues. Absolutely critical, and we believe this is a fundamental best practice for instrumenting your environment. The second key is that platform operators should instrument as much as possible. There shouldn't be manual efforts on the part of your consumer, of your container app platform consumer or your developer to have to do this instrumentation themselves. And what we mean by that is when containers are dropped into an environment, you should be able to automatically see the host and network metrics. You should automatically be able to tag with relevant OpenShift and Kubernetes metadata. And you should have this idea of non-intrusive collection of application metrics. And I'm gonna push this a little bit further in our next demo when I actually show you an OpenShift environment. But the key is if in that last demo, I was the developer and I was just coming in and spinning up those WordPress and MySQL containers, I should be confident that the underlying platform is automatically instrumented. And here's another thing, I didn't show this to you in the last environment, but I believe I can show it to you in the next one, which is developers should be able to easily add custom instrumentation without regard to container placement. So again, if you've got complex per pod models, if you've got complex requirements around how things need to be set up, then developers have to do things just so to ensure that their custom stats D or JMX metrics will exit their container and be consumed by a correct endpoint. We think developers shouldn't have to think about that at all. All they should have to do is emit the metric and the underlying system should capture those. And not only should they capture them, they should auto tag those based on container, host, application, microservice for later analysis. So that's me on my soap box for a moment, but you came for best practices, so here you go, you're gonna get them. And this is the way we think about instrumenting systems. All right, so let's keep going here. Let's talk about visualization and troubleshooting here. So monitoring, go ahead. We can one more question on the prior before we head in down. He is asking if we have a scenario where we deploy a SQL container with an environment password. End users may reset the SQL password. We noticed users tend to do this as we have many situations where the OpenShift health probes fail due to password changes. What measures approach do we take here? A different environment with dedicated user persisting or what would you suggest? Yeah, that's a very good question. So I believe that this is where our environmental variables for credentials come into play. So as long as the user keeps those credentials updated, you won't have that problem. Now, if they don't keep that updated, then you will have that issue, but it's relatively easy to keep those up to date. So that's our best step one. Step two is you can start alerting on conditions which are associated with those metrics going away. For example, if my SQL connection counts drop, there could be two reasons. Like one, my SQL environment is failing. Two, we're not able to see those connections because somebody changed something in the underlying system. So you could set up that alert and have that fired back to the owner of the container who can then say, oh, what's going on here? This looks bad, let me check it. Oh, it looks like my credentials have failed and I need to reset them. Okay, and one more question snuck in. Yes, I love it. If you wanted this, so I'm gonna keep shoving them at you. How much overhead in percentage does Sysdig approximately produce on a host? Excellent question, let's see here. So we'll do two things. One, if I remember when I hop into the next environment, I'll just show you because Sysdig can monitor Sysdig, but in general, it's less than 5% CPU and about 512 meg of memory. The CPU is tunable, so you could go into the agent and set a lower, harder cap. And what happens is if we're consuming up to our limit in underlying resources, we will start sub-sampling data so that we don't overstep our bounds because we realize that it's more important to give those resources to your application versus our agent. Okay, and there's one more. It's just gonna keep coming. How do developers control what information is emitted to the Sysdig system? His concern is that Sysdig should not be capturing sensitive customer data, like credit cards, or security numbers, et cetera, can developers scrub data that is being emitted? Certainly developers can scrub data that is being emitted. They can always mask it in their own application and we encourage that for highly sensitive data. Let me see here if I have another answer to controlling that data. We do have some ways within our application to turn off particular forms of data collection so that for customers who are very, very sensitive, we can avoid collecting particular forms of data. I probably don't have time to go into more detail there but if that's a real concern, I'd say step one is making sure Sysdig works for your environment and then step two would be kind of tuning data collection. I have actually a third answer which we will address in the demo which is Sysdig can allow you to isolate sets of data for particular users. So you can imagine that the users who work or the developers who work on, say, your billing system will probably actually need to access a lot more data than people who work on your API server. So can you isolate those two teams and give them access to only the right data? That's another best practice that I'll get into in a moment. With that, Diane, any more questions? Let's put them on hold for a bit so that we can keep rolling here. Keep going. All right, cool. All right, so you have this interesting situation when you get into a container application platform which is you kind of got two tiers of users. You have ops or DevOps who needs to see everything. They need to understand the entire platform, the performance of all systems at once and you see teams of developers who need to see their services or applications and not just their containers. So how do you give developers the right level of access? And more importantly, how do you give developers the ability to collect deep troubleshooting data when you don't want to give them access to the underlying platform and resource allocation is elastic and dynamic? The developer does not actually know where their container will be running, yet they're responsible for understanding why their application is not performing the way it is. This is a really difficult challenge and we've needed to layer on multiple ways of thinking about data in order to solve this problem. And it, of course, all comes back to that instrumentation I was describing before. With previous instrumentation, typically you would see CPU memory disk, few other metrics and it was good enough to relate to the performance of your application. After all, in kind of the static monolithic age, you had an app, it was running on a host and the host and the app were intricately entwined. That's not the case anymore. And so what we needed to do is create better visibility and it comes out in three forms. Microservice performance management. So first of all, aggregate information across the containers that make up a microservice and tell me how that service at large is performing. Number two is this idea of service-based access for teams. So if I'm the development team concerned with the API service, I probably don't care quite as much about these other services and I don't need to be burdened with having to look through that data unless I'm troubleshooting a complex system-wide problem. And then finally, getting down to troubleshooting, how do you go deeply within this container application platform to allow a developer to see actually what's happening under the surface without just giving them the keys to the platform and letting them run amok? So let's go deeper here. We'll walk through each of these. All right, so you've got this environment. Here's our container application platform. So can the dev team on, who's covering service to the green service actually understand how the service is performing? So can you answer the question of what percent of my CPU shares am I using across service to? It's actually a pretty hard question, right? And imagine this not being on eight containers but 8,000 containers or 800 containers. That's pretty challenging. And so one of the things we allow you to do on the fly within the application is dynamically aggregate information across containers using the metadata of your service. So you can group those together on the fly and say, okay, how service to performing? What's the response time of my MySQL service that's currently distributed over three data centers or my Cassandra service or my Apache service? What are the slowest queries or the slowest endpoints? What are the key metrics here? And how can I view them in aggregate? The way we do this is through a deep real-time understanding of orchestration metadata. And as I'll show you in a moment, as that data flows in, we tap into your orchestrator, your OpenShift environment. We capture all the relevant tags for your containers and hosts and dynamically aggregate your metrics on that metadata, enabling you to answer these questions. We talked a bunch about this, but in order to do that effectively, you have to auto discover applications at the same time. I'm not gonna spend a whole lot of time in there. Instead, I'd like to talk about this idea of service-based access control. Distig will also allow you to not only aggregate information based on that metadata, it will allow you to isolate information on that metadata. So you can isolate your prod environment. You can isolate the sensitive billing application to a particular team or even a particular person or only provide access to it temporarily when you know some troubleshooting needs to be done. This is a really new functionality of Distig and it's highly differentiated. I think that in the old world, people were comfortable with role-based access control. Hey, I'm the director of the team, therefore I have a higher level of access. Service-based access control, however, is relatively new and is kind of a characteristic of dynamic environments like OpenShift. I'm really excited to demo this one to you. And finally, trace-driven troubleshooting. So I think plenty of monitoring tools can give you a graph that go kind of up into the ride or down into the ride when something's going bad. Plenty of tools will also give you dashboards. It's nice to have full-stack dashboards. There are a couple of interesting things that SIST can do on top of that, which make it possible to troubleshoot faster, more effectively, even in environments where you don't have complete access. First of all, we can correlate events to your systems. This means that in addition to collecting metrics, we can capture events from Kubernetes, Docker and any custom event that you'd like to send to our API for you to overlay them with your metric and understand what events may have impacted the performance of your service. And then finally, we enable something very interesting with alerting, which is called capturing a trace file. So a trace file essentially gives you an entire system capture of every single system call that was happening when an alert fired. That means you can troubleshoot these containers remotely and you can even troubleshoot them after those containers are long gone because your orchestrator may kill off a poorly performing container, but you still need to troubleshoot it. How do you do that? Here's an example. Replay the state of my system last night when the off-service alerts fired five minutes before we saw the whole app go down and show me all the system calls from all the containers that had been destroyed. Imagine doing that with one of your legacy monitoring tools. It just wouldn't happen. Okay, it's demo time. Let's do this. Let's pop here. This all looks good. Make sure everything's running properly. Okay, I think we're in good shape here. Close your eyes for a moment, everybody. I need to switch dashboards. Okay, as always, keep those questions firing in. For those of you who joined late, remember, drop your name and email in chat. I'm sending a free t-shirt to everyone because I'm excited about showing off this new demo environment. So here we are in OpenShift. This environment is real. It's our MMO environment here. So a couple of things I wanna note here. You'll notice your projects here. We've got a Java app and a WordPress app. We're gonna be digging into each of those in some detail. In the middle here, you have an interesting project, which is the Sysdig project. And these are all the monitoring agents for the platform as a service or for the container application platform. And things to note here. The way this is set up is you set this up as a separate project. You can then use OpenShift's administrative controls to decide who gets access to the underlying instrumentation. This is super clean, super easy. It means that none of your developers need to worry about instrumentation. Your ops team can take care of it. It's also deployed using a Kubernetes daemon set. So as you expand and grow your environment, it's easy to pin a Sysdig agent to every single physical host. We can drop into each of these environments. And for those of you who are familiar with OpenShift, you'll see here, we've got a bunch of containers running. So Cassandra, Redis, Mongo, Java. This is like database heaven, basically. We also have a client running here, which is kind of giving us a synthetic load across this application. So that's cool. Of course, you know you can deploy some health checks and health checks give you really basic information about what's going on here. All right, time to hop over into Sysdig. And let me do this here. Let me separate that out. Okay, time to hop into Sysdig here and start examining this data. So first of all, we're dropped into what's called a physical topology map. This is a dynamically created map. So because we understand all the network connections in addition to your host performance, we can automatically determine who's talking to who. And you'll see here that we've got our various hosts that are all talking to each other in various forms. Now, we have a functionality that we call Google Maps for your infrastructure, which is you can actually zoom in here and see what's running on this individual host. So this is pretty cool. I can get down to the individual container and I can drill down, frankly, even all the way into the process. Now, that's great, but this is a mess. Like could you imagine troubleshooting this? Like troubleshooting these network connections? Wow, that's crazy. And there's no real relationship here to your application itself. So instead, let me apply this idea of microservice performance management and use metadata from our OpenShift orchestrator to reorganize the containers. Now you see the Java app and you see WordPress. And let's zoom into Java. Ah, now things start to get much more interesting here. So here, you can essentially see, you can see the Java app deployment, which is an abstract concept from Kubernetes. And you can see the actual replica sets that make up that job application. And you can see that they're all talking to Cassandra, Mongo and Redis. And so very quickly and very easily, we were able to reorganize this data and now I can troubleshoot across a service. Now I can troubleshoot across a replica set. Now I can understand this at the logical layer of my application and not the physical layer of my underlying infrastructure. All right, I'm gonna get into a moment. In a moment, I'm gonna get deeper into how we created this view and how we're able to take this service oriented view. Before we do that, let's take an overview of our Java project here. And this gives you an example of a dashboard that would aggregate information across an entire service, regardless of how many containers here. You can see here, we've broken up a request count by service. We can correlate this with events. So we can see right here, we were getting an alert event. And that may help us troubleshoot very effectively. You can see some of the data we're collecting here, things like CPU shares, top namespaces, most important resources consumed across our service. And kind of this related information, file system memory. So you're seeing everything from the high level of how is the application performing? What type of requests are we getting down to underlying resource utilization? These are our dashboard environments. You can of course create any kind of dashboards that you want entirely custom. You can use our wizards to create simpler environments as you please. Now let's go into the explore view. And as I was describing before, the explore view is kind of like Htop for your entire environment. Now up here, we see the metadata we're using to help understand our environment. So here we're grouping things by host and then container ID. So you can see here a host and I can drill down on container ID. I can click on any row of this table and I get that dynamic aggregation of information across whatever group I've selected. So now I'm getting an overview by process for this IP address. Very, very cool, super helpful. Great if you can drill down here. But again, what if instead I'd like to see this in something that's more relevant to my particular application? So instead let's do this Kubernetes oriented view here. And what's happening right now is we're dynamically aggregating this information and you'll notice this, these namespaces are akin to the projects we had running in our OpenShift environment. So this is cool. So now I can understand this exactly in the same way that OpenShift is describing my logical application environment. And you can see here, we have three replica sets for this Java application. So I can click on this and now I'm getting the aggregative information for this Java application. I'm gonna show you a couple more things here and then we're gonna stop for questions before we go into the final component, which is alerting. But what I wanna show you here is first of all, we're getting an overview by process. That's interesting, but this is Java. So let's take a look at JVM metrics. So you can see here for JVM, we're collecting a heap usage, garbage collection and thread count. We're actually, this environment's pretty rudimentary. So we're not really doing any meaningful garbage collection here. But you can see interesting stuff like the thread count dropping off. We're kind of setting up alerts as we go and I'll talk about alerts in a moment. Let's do one other thing here. Let's hop over to a metrics view. So since we're looking at the Java app, we may find some JMX metrics here. So let's take a look here. These are all Cassandra based. So let's do this. Let's hop over to Cassandra metrics. And now we can see every single JMX metric that is executed by the system. The important part here is that the developer didn't have to tell or point to our agent as an endpoint for this metrics. And they didn't have to include a local polar in their pod. We automatically see these metrics running through the system calls and we can capture them. This makes instrumenting Java based environments insanely easy. We do a similar thing for stats D metrics, although I don't believe I have any stats D metrics running in this environment. But this ability to not only get dashboards, ooh, check this out, I do, cool. This is a stats D metric that we're showing in the same way. The key here is that SysTag becomes kind of this one place where you can aggregate all of your system metrics, all of your custom metrics, all in one place, and everyone can access the right data at the right time. All right, that was a whole lot of information. I'm assuming there are some questions. There always are questions, there are questions. And they're good questions. So let's see, let's do it LIFO. Are you able to follow transactions running through the JVM line by line? You are asking? Quite code injection. Correct, okay, I got the question, I understand it. So no, we do not provide line by line transaction tracing. That is the realm of the New Relics and App Dynamics of the world. And fundamentally, we're a pretty different product. Now, as the question probably came up because many of the metrics we provide start to bleed into that territory. But we do something different here. We do not instrument inside your application. We're instrumenting from outside your application. So we can provide you effectively an aggregate view of transactions as they take place between containers or between hosts, but we do not see into the individual line by line code that you have as well. So what we find is that in many cases we provide enough APM that customers are happy using us. And because of how we're designed and designed to scale across your environment, you could deploy this everywhere in Dev and Prod, which is something you probably would never do with an APM product. Next question, please. All right, in an OpenShift environment, could we just use the node selector to deploy assisted container per node or is there a better good practice recommendation? Yeah, we do have a best practice documentation that we've deployed on, we put up on our support site. And in fact, if I hop back here for a moment, this is a slide I skipped over. The URL is down here, it's probably not super readable, but if you search for Sysdig install OpenShift on Google, you'll find it. And basically it's a four step process. You have to modify the security context constraints, install as a Kubernetes daemon set, and well, step four is you're done. So it's pretty easy, it's pretty straightforward. All right, another one just popped in and there are a bunch and we are good. I'm just gonna say right now, if you're able to run over time, I'm fine with hanging out if we run over time. Sometimes OpenShift in Kubernetes API diverges. When replica sets were added to Kubernetes Sysdig temporarily started ignoring OpenShift replication controllers. How are you managing this going forward? We're trying to be better. Better, more faster all the time is our goal. Yeah, we're a young company, we have a big challenge which is we do wanna maintain a functional parity between OpenShift and Kubernetes. I will also say that OpenShift has become more and more important to us as a company as we've grown. We have great customers who are using OpenShift and that gives us higher priority to continue to invest in OpenShift. So I apologize for the temporary divergence and we're gonna do our best to make sure that doesn't happen. All right, there's a couple more here. What he's saying, nice to see you have project overview. Does it allow you to get an overall cluster view? For example, a dashboard showing overall cluster view and overall nodes view. Yeah, that would be really easy to create with our custom dashboarding capability. We also do have a bunch of kind of overviews built here and honestly, you would end up today using the Kubernetes pre-built dashboards if you were interested and customize off of there. Another one, you may have answered this. Is it possible to use deployment config instead of deployment? Christian is asking that. I'm stumped, you got me. Whoever that is, got me, nicely done. We'll have to get back to you. All right, in the topology view, are the plans to allow for repositioning of objects to enhance visibility when zooming in? Ah, awesome question. Yeah, we've got a bunch of plans for improving the topology view. In all honesty, we're doing a bunch of work to this view you see right here, the Explorer view. That's our next big project and the topology improvements will probably come after that. Okay, there was one, does Sysdig find the bottlenecks in the app pinpointed down to the line of code? And I think that's the same as the transaction one. Yeah, so we can do that. And the other one that I think someone answered for him, what was the Kubernetes tool that allows you to make sure it deployed on all the compute node hosts and somebody answered Damon said. That's great, cool. All right, move along. Yeah, let's move along. We're close to the end here. I think we'll be roughly on time and then I can stay late for Q and A. I love answering questions. But we talked about this idea of data security and kind of isolating data so that the right teams get to look at the right information. There are actually two benefits to that. One, sure there's a security benefit which is if you've got a sensitive app, you don't want everybody to have access to it. You may not want everybody to have access to your prod environment. But there's also a troubleshooting benefit. If you get dropped into a huge complex monitoring tool and there's a lot of data and you're trying to firefight and solve a problem, it's, man, it's messy, right? You don't understand what's going on. You don't really understand what the makeup of somebody else's app is and how you see it. You really want to be able to focus down on your data and the relevant data. And that's where we built a functionality that we call teams. Teams gives you service-based access control. So using the same metadata we use to group information into logical components, you can isolate data based on those cases as well. So you can see up here, we're in the default team. The default team is essentially the platform view, meaning I can see everything, but I can switch teams. And we've got a couple teams here. We've created a Java team and a WordPress team. So let's do this. Let's go into the, actually before I do this, let's reaffirm here. So we've got, I'm gonna shrink this a little bit. Okay, so you can see that we've got this Java namespace and we've got the Cystic namespace and we've got this WordPress demo namespace, okay? And so now let's go into the teams here and switch to the Java team. Give this a moment. The system is ensuring that I can access this team. They're ensuring what metadata is relevant to this team and what's going to happen here is we're going to shrink down this environment. Come on, you can do it. We can shrink this down into just the Java team. Ah, you're killing me. The demo gods are telling me my time is up. Oh no. Had to happen, right? We'll give it a quick refresh here and see if it's my wifi connection. There we go. Try this again. Okay, so now let's try this again here. We want to view this as the Java team. Okay, we'll try it one more time. We don't have any luck. We'll call it a day, but it looks like it's happening. Great. So now let's do this. We're going to do our regrouping again. We're going to apply this and we're missing the group information here. What's really going to happen, it's a brand new feature by the way, it's actually live later this week. Aha, finally, is now you see, the only thing we can see is the Java application demo. And now as a Java user, a Java team user, I only have access to this information. This includes all relevant dashboards and all relevant alerts. So I'm able to scope down your environment based on the same metadata that OpenShift and Kubernetes use in order to give control and insight into the right people at the right time. This is service-based access control and is a critical part of deploying and monitoring product that works properly with your container application platform. All right, one last thing here we want to do, which is we want to set alerts. That's about the only thing we haven't shown here. You can set alerts from anywhere in the application, whether it's explore or it's the dashboards. In fact, we'll hop back to the dashboards here, go back to the default team. Actually, you know what? We'll stick to explore in the interest of time. We'll do one thing here, which is we're going to set up a service-based alert. So honestly, if you had to get alerts on individual containers, you'll probably kill yourself. And instead, the way to think about this is how do I alert on an aggregate of containers? So here you can see, I won't go into the whole alerting process, but you can aggregate across any tag or metadata in your system. So here I'm saying give me an alert across Java app demo and where the replication controller is Cassandra One. Let me kill this. Now, I have an alert across my entire Java app demo environment. One alert that could service that whole environment. I could also segment that alert. So I could say one, I could create this one alert and have one per replication controller or replica set, one per deployment and one per namespace, so on and so forth. You can send those notifications anywhere you want. And finally, you can create Sysdig captures. And so we didn't really have time to go into this, but as I mentioned earlier on, we have the ability to not only capture information about your standard metrics that are performing, but we also give you the ability to then download to your laptop or to another machine a system capture file. And these files can be very, very large. This one is a massive file that runs across an entire Kubernetes environment and captures data across an entire namespace. And now I get inside the open source Sysdig tool, I get the ability to troubleshoot each and every process down to its arguments, down to its connections, down to its payloads. And this is something that no other monitoring tool can provide you and no other monitoring tool has the ability to give you this view into your container application platform. So you can see that you can spend a ton of time looking at this fantastic data just in the interest of time. I won't go into that now. That's probably a whole other session for you. So very quickly in terms of wrap. We'd love to have you come back and do another whole session again, so this is for you. Thank you. Just very quickly, while Sysdig, we exist because we love tech, we also need to make money along the way. So we sell a couple of products. I wanted to tell you about those. Sysdig Cloud is our hosted offering where you can do an individual host all the way up to thousands of hosts monthly or annual. Sysdig on-premise, it's the same product but then you take our software, you run it in your environment. It's annual only with a minimum commitment. And Sysdig open source. Man, we love troubleshooting Linux systems and that's what Sysdig open source is all about. Sysdig.org, check it out. And we also love free trials. So come onto our website, we'll give you a free trial. You can see this kind of core functionality that I was talking about from viewing microservices all the way down to keep deep container visibility. We've got the goodness for you. I would love to interact with you directly and get you running on a trial. Just come to Sysdig.com and we'll get you started on that. All right, there are, there's a couple of questions. One, somebody might have answered himself, but let's see. If you have time, can you explain a little more about service-based access control versus RBAC? Yeah, sure. I would love to. And this is really new functionality for us. So I'm happy to clarify. Role-based access control typically happened based on a person. And service-based access control happens on the intersection of a person and an environment. So if I hop over here to settings, you can come over here to teams and I can show you how you define a team. So if I click on the Java team, see here. Okay, so within the Java team here, we're doing a few things. We're determining whether the information is scoped based on a host or container. You're determining what that scope looks like here and that scope, this is the important part. The scope is defined by metadata of your system. This is the service. In service-based access control, your service is defined by some scope of metadata. It could be something like DevProd. It could be something like the Java deployment. It could be down to a replica set that you care about. That's for you to decide. And then finally, you can determine what users actually are allowed inside that team. And so you create this team based on a service and then decide which users have access to it. Hopefully that helped answer the question. I think so. You answered it for me. We're running slowly out of time and here's one sort of, I think out of left field. Consisting install and monitor database such as MySQL running on VM host of OpenStack. Whoa, that is out of left field. That is left field. So it should be able to. That should be perfectly fine. I can't say that I've tried it myself, but there's nothing that would block us from doing that. We'll get up close, get some RDO up there and make you try it someday. And there's one more coming in. What about monitoring of SEF cluster nodes in an OpenShift environment? And secondly, is it possible to get metrics on bandwidth, i.e. which containers are generating the most traffic, et cetera? Sorry, he's just apologizing for what he asked. If you can, I think it's finished. So I missed the last part. Let me start by answering the first two. So what about SEF? That's a great question. It seems like a softball to me because we just published a three-part series on monitoring SEF in OpenShift environments. So go to cystic.com slash blog and it'll be right at the top of the blog roll. But we do a killer job monitoring SEF. What was this, I forgot the second question on it. We'll talk a little bit about cluster nodes as well. So are you gonna write a cluster series? That sounds like a request. I like it. Sure, we will. Not exactly sure when, but we will. In terms of bandwidth, absolutely. I believe it's on our network overview. So you can see the actual consumption of bandwidth. We don't track what your, say, bandwidth limits are or anything like that, but that would be relatively easy to do. But fundamentally, we do have network data. And right now I'm showing this across the replica set here. I could drill down all the way into an individual container and I could show you CPU, this is CPU use metrics. Let's go back into network here. This'll be your network data for that individual container. So easy, easy. All right, that was easy. There's one more. Are you monitoring all of the compute node host performance also? Yes. We'd be great to see the performance of our OpenShift cluster as a whole and what applications and services are taking the most resources. And I think that's our last question. Yeah, sure. So what I've been showing you primarily have been service oriented views, but kind of the reason we do that is that basically people expect us to be able to monitor at the physical layer and we can absolutely do that. So in fact, going all the way back to where we started, I can switch this view to a physical view. And so this physical view now tells me the individual hosts that are running this environment and the aggregate utilization of file system, CPU, network and memory, you could drill down individual containers. So the beauty of SysDig and what we described early on is you have to be able to tag metrics in many, many different ways so that as you come in to troubleshoot an environment, you can slice and dice it however you need to. This is what we do. This is a fundamentally different approach to monitoring than other systems provide you from an interface perspective and it's part of what makes SysDig so powerful. Okay, and someone snuck one more in asking, does SysDig have an endpoint API? Is it possible to look at the data by accessing an API, getting raw data from the dashboard, for example? Yes, we do have APIs, they're fully functional, pretty much anything you can do in the app you can do via API. But whenever I hear that, I say, why would you want to go somewhere else? Okay, okay, well that's probably a good one. And I'm guessing you all will be at KubeCon in Berlin or someone from SysDig will be there. We will be, we absolutely will be at KubeCon in Berlin and we'll also be at Red Hat Summit in May so we're looking forward to both of those events. Perfect, well we're going to be hosting an OpenShift Commons gathering the day before. We always do that at KubeCon and we would be thrilled to have you guys come. I put a little free promo code for anyone who's attending this briefing to join us the day before. The beer will be good and there'll be talks from Project Leads and folks from all the different upstream projects from Google and Microsoft and we'll definitely have one of the SIGs tables be at lunchtime, be monitoring, and we'd love to have you guys come and answer questions there face to face. So let's see, there's nox answered a question. And everybody's saying this is great and applause, applause. And really we'll have to rinse and repeat and do this again in another few months because I'm sure you'll have more features to show off and Nox put in something that I haven't heard about before around some security offering that you have as well called FALCO. So. Yeah, that's a good point. So just a quick note on that which is we recently open sourced a new project called Sysdig FALCO and if you go to sysdig.org slash FALCO you'll learn all about it. It's a security monitor for containerized environments. And so you can easily detect anomalous behavior like a container spawning a shell or a container accessing sensitive files and then alert on those behaviors so that you can troubleshoot and solve problems before they become really big issues. Awesome. Well, thank you both Nox who's being the silent partner in the Q and A and Aperna, but this has just been an awesome session and we should have the video up for you on Monday. And so if you could be the last slide up could you put an address up for someone to send an email to fire addition questions? I actually didn't, but if you do have any additional questions you can always hit us up on Twitter at Sysdig or info at Sysdig and we'll be happy to answer. There, avoiding the spam. All right, well done. All right guys, I will capture all the chat and send it off to you and we will be hopefully drinking a beer together in Berlin sometime very soon. All right, thanks so much for hosting. We really appreciate it. Take care guys.