 From theCUBE Studios in Palo Alto in Boston, connecting with thought leaders all around the world, this is a CUBE Conversation. Hello everyone, welcome to this CUBE Conversation. I'm John Furrier, host of theCUBE. We're doing a content series called Leading with Observability, and this segment is Network Observability for Distributed Services, and we have CUBE alumni, Mike Cohen, Head of Product Management for Network Monitoring at Splunk. Mike, great to see you, it's been a while. Going back to the OpenStack days, Red Hat Summit now here, talking about observability with Splunk, great to see you. Thanks a lot for having me. So the world's right now observabilities at the center of all the conversations from monitoring and investing infrastructure on premises, cloud, and also cybersecurity. A lot of conversations, a lot of broad reaching implications of observability. You're at the Head of Product Management for Network Observability at Splunk. This is where the conversation is going, getting down into the network layer, getting down into the packets move around. This is becoming important. Why is this the trend, what's the situation? Yeah, so we're seeing a couple of different trends that are really driving how people think about observability, right? One of them is this huge migration towards public cloud architecture, and you're running on an infrastructure that you don't own yourself. The other one is around how people are rebuilding and refactoring applications around service-based architectures, scale-out models, cloud-native paradigms. And both of these things, they're really introducing a lot of new complexity into the applications and really increasing the service area of where problems can occur. And what this means is when you actually have gaps in visibility or places where you have separate tools, analyzing parts of your system, it really makes it very hard to debug when things go wrong and figure out where problems occur. And really what we've seen is that, people really need an integrated solution to observability and one that can really span from what your user is seeing, but all the way to the deepest back-end services. Where are the problems in some of the core infrastructure that you're operating so that you can really figure out where problems occur? And really never give visibility is playing a critical role in kind of filling in one of those critical gaps. You know, you think about the 10 years past decade we've been on this wave, it feels like now more than ever, it's an inflection point because of how awesome cloud native has become from a value standpoint, value creation, time to market all those things that why people are investing in modern applications. But then as you build out your architecture and your infrastructure to make that happen, there's more things happening. Everything as a service creates new dependencies, new things to document. This is an opportunity certainly on one hand and on the other hand it's a technical challenge. So balancing out technical debt and or deploying new stuff, you got to monitor it all, right? Monitoring has turned into observability, which is just code word for cloud scale monitoring, I guess. I mean, is that how you see it? I mean, how do you talk about this? Because it's certainly a major shift happening right now and this transition is pretty obvious. Yeah, no, absolutely. And we've seen a lot of new interest into the network visibility, network monitoring space. And really again, the drivers of that like, network infrastructure is actually becoming increasingly opaque as you move towards public cloud, kind of public cloud environment. And it's been sort of a fun meme to blame the network and say, look, oh, it's the network, we don't know what's going on, but it's not always the network. Sometimes it is, sometimes it isn't. You actually need to understand where these problems are really occurring to actually have the right level of visibility in your systems. But the other way we started talking to people think about this is the network as an empowering capability, an untapped resource that you can actually get new data about your distributed systems. SREs are struggling to understand these complex environments, but by, with the capabilities we've seen and started taking advantage of with things like EBPF and monitoring from the OS, we can actually get visibility into how processes and containers communicate and that can give us insights into our system. It's a new source of data that actually has not existed in the past that is now available to help us with the broader observability problem. You mentioned SRE site reliable engineers, as it's known, Google kind of pioneered this is become kind of a standard persona in large scale kind of infrastructure, cloud environments and whatnot like massive scale. Are you seeing SREs now that role become more mainstream in enterprises? I mean, because some enterprises might not pull on SRE medical or cloud architecture. I mean, can you just help us, to get that tie that together? Cause it is certainly happening. Is it becoming proliferated? For sure, absolutely. I think SREs are because the title may vary across organizations as you point out and sometimes the exact layout of the organizational breakdown varies, but this role of someone who really cares about keeping the system up and caring for it and scaling it out and thinking about its architecture is now a really critical role and sometimes that role sits alongside developers who are writing the code. And this is really happening in almost every organization that we're dealing with today. And it is becoming a mainstream current. Yeah, it's interesting. I'm going to ask you a question about what businesses are missing when they think about how to think about observability, but since you brought up that piece, it's almost as if Kubernetes created this kind of demarcation between the line between half the stack and the top of the half and the bottom half of the stack where you can do a lot of engineering underneath the second half of the stack or the bottom of the stack up to say Kubernetes. And then above that, you could just be infrastructure as code application developer. So it's almost kind of like leveled out with nice lanes there. I mean, I'm oversimplifying it, but I mean, how do you react to that? Do you see that evolving too? Cause it all seems cleaner now. It's like you're engineering below Kubernetes or above it. Well, absolutely. It's definitely one of the ways you see sort of the deepest engagement and as folks go towards Kubernetes, they start embracing containers. They start building microservices. You'll see development teams really accelerate the pace of innovation that they have in their environment. And that's really the kind of the driver behind this. So we do see that, that sort of rebuilding refactoring as some of the biggest drivers behind these initiatives. What are businesses missing around observability? Cause it seems to be a, first of all, a very overfunded segment, a lot of new startups coming in, a lot of security vendors are in, you're seeing network folks moving in, what's almost becoming a fabric feature piece of things. What is it mean to businesses? What are businesses missing or getting? How are people evaluating observability? How do you see that? Yeah. So I'll, for sure, I'll talk, I'll sort of initially talk generically about it, but then I'll talk a little bit about network area specifically, right? That I think one of the things people are realizing they need in observability is this approach that's an integrated suite. So having a disparate set of tools can make it very hard for SREs to actually, take advantage of all those tools, use the data within them to solve meaningful problems. And I think what we're seeing, and as we've been talking to more people in the industry, they really want something that can bring all that data together and build it into an insight that can help them solve a problem more quickly, right? So that, I think that's the broader context of what's going on. And I think that's driving some of the work we're doing on the network side, because network is this powerful new dataset that we can combine with other aspects of what people have already been doing in observability. What do you think about programmability? That's been a big topic, but you started to get into that kind of mindset. You're almost making the software-defined aspect come in here heavily. How does that play in? What's your vision around making the network adaptable, programmable, measurable, fully, fully surveilled? Yeah, so I think what we're focused on is the capabilities you can have in using the network as a means of visibility and observability for systems. Networks are becoming highly flexible. A lot of people, once they get into a cloud environment, they have a very rich set of networking capabilities, but what they want to be able to do is use that as a way of getting visibility into the system. So to talk for a minute or two about some of the capabilities we're exposing, use it in network observability, one of them is just being able to visualize and optimize a service architecture. So really seeing what's connecting to what automatically. So we've been using a technology called EVPF, the Extended Berkeley Packet Filter, part of everyone's Linux operating system, right? You're running Linux, you basically have this already. And it gives you an interesting touch point to observe the behavior of every processing container automatically, and you can actually see with very low overhead what they're doing and correlate that with data from systems like Kubernetes to understand how distributed systems behave, to see how things connect to other things. We can use this to build a complete service map of the system in seconds automatically without developers having to do any additional work without forcing anyone to change their code, they can get visibility across an entire system automatically. That's like the original value proposition of Splunk when it came out, it was just a great tool for Splunking the data from logs. Now, as data becomes more complex, you're still instrumenting. And these are critical services. And they're now microservices. There's the trends at the top of the stack and at the network layer. The network layer has always been a hard nut to crack. I got to ask you, why now? Why do you feel, and you mentioned earlier, everyone used to blame them, oh, it's not my problem. You really can't finger point when you start getting into full instrumentation of the traffic patterns and the underlying processes. So it seems to be good magic going on here. What's the core issue? What's going on here? Why is it now? Why is the time now? Yeah, so, yeah, for, well, so unreliable networks, low network, DNS problems. These have always been present in systems. The problem is they're actually becoming exacerbated because people have less visibility into them. But also as you have these distributed systems the failure modes are getting more complex. So you'll actually have some of the longest, most challenging troubleshooting problems are these network issues which tend to be transient, which tend to bounce around the systems. They tend to cause other unrelated alerts to happen inside your application stack where multiple teams would troubleshooting the wrong problems that don't really exist. So the network is actually caused, some of the most painful outages that teams see. And when these outages happen, what you really need to be able to know is, is it truly a network problem or is it something in another part of my system? If I'm running a distributed service, which services are affected? Because that's the language now my team thinks about. As you mentioned, now they're in Kubernetes. They're trying to think which Kubernetes services that are actually go, affected by a potential network outage that I'm worried about. The other aspect is figuring out the scope of the impact. So are there a couple instances in my cloud provider that aren't doing well, is an entire availability zone having problems? Is there a region of the world that's an issue? Understanding the scope of this problem will actually help me as an SRE decide what the right mitigation is. And by limiting it as much as possible, it can actually help me better hit my SLA because I won't have to hit something with a huge hammer when a really small one might solve the problem. Yeah, this is one of the things that comes up and you almost just hearing you talk, I'm seeing how it could be complex for the customer, just documenting the dependencies. I mean, as services come online, someone's going to be very dynamic, not just at the network, both the application level, we mentioned Kubernetes and you got service measures and microservices. You're going to start to see the need to be tracking all this stuff. And that's a big part of what's going on with your suite right now, the ability to help there. How are you guys helping people do that? Yeah, absolutely. So, just understanding the dependencies is one of the key aspects of these distributed systems. This began as a simple problem. You have a model effect application that kind of runs on one machine, you understand its behavior. Once you start moving towards microservices, it's very easy for that to change from, look, we have a handful of microservices to we have hundreds to be a thousand. And they can be running across thousands or tens of thousands of machines as you get bigger. And understanding that environment can become a major challenge. Teams will fill end up with the handwritten diagram that has the behavior of their services broken out, or they'll find out that there's an interaction that they didn't expect to have happened. And that may be the source of an issue. So, one of the capabilities we have using network monitoring out of the operating system with EVPF is we can actually automatically discover every connection that's made. So if you're able to watch the sockets that are created in Linux, you can actually see how containers interact with each other. And you can use that to build automatic service dependency diagram. So without the user having to change the code to change anything about their system, you can automatically discover those dependencies and you'll find things you didn't expect, you'll find things that change over time that weren't well documented. And these are the critical level of understanding you need to get to in this environment. Yeah, you know, it's interesting, you mentioned that you might have missed them in the past, people have that kind of feeling at the network, either because they weren't tracking it or they use a different network tool. I mean, just packet loss by itself is one service and host health is another. And if you could track everything, then you got to build it in. So I love this direction. My question really is more of, okay, how do you operationalize it? Okay, I'm an operator, am I getting alerts? Do I just auto discover? How does this all work from a usability standpoint? How do I, what are the key features that unlock? What gets unlocked from that kind of instrumentation? Yeah, well, again, when you do this instrumentation correctly, it can be really, it can be automatic, right? You can actually put an agent that might run in one of your instances, collecting data based on the traffic and the interactions that occur without you having to take any action. That's really the holy grail. And that's where some of the best value of these systems emerge is it just works out of the box and then, you know, pull data from other systems like your cloud provider, from your Kubernetes environment and use that to build a picture of what's going on. And that's really where this is, you know, where these systems get super valuable is they actually just, you know, they just work without you having to do a ton of work behind the scenes. So Mike, I got to ask you a final question. Explain the distributed services aspect of observability. What should people walk away with from a main concept standpoint and how does it apply to their environment? What should they be thinking about? What is it and what's the real story there? Yeah, so I think the way we're thinking about this is how can you turn, you know, the network from a liability to a strength in these distributed environments, right? So, you know, what it can, you know, by observing data at the network level and, you know, out of the operating system, you can actually use it to automatically construct service maps to learn about your system, improve the insight and understanding you have of your complex systems. You can identify network problems that are occurring. You can understand how you're utilizing aspects of the network. You can drive things like, you know, cost, you know, cost optimization in your environment. So you can actually get better insight and, you know, be able to troubleshoot problems better and handle the blame game of, is the network really the problem, you know, that I'm seeing or is it occurring somewhere else in my application? And that's really critical in these complex distributed environments. And critically, you can do it in a way that doesn't actually add overhead to your development team. You don't have to change the code. You don't have to, you know, take on a complex engineering task. You just, you can actually deploy agents that'll be able to collect this data automatically. Awesome, and take that complexity away and automate how people get the job done. Great, great stuff. Mike, thanks for coming on theCUBE. Leading with Observability, I'm John Furrier with theCUBE. Thanks for watching. Yeah, thanks a lot.