 So welcome to this breakout session. So the intention, who, nobody, was anyone here at the 2020 community day? Has anyone done like an un-conference before or a DevOps days? Okay, and you all open spaces in a DevOps days. So that's the intention here, right? Like this is not me standing here and talking to you. This is everyone kind of chatting with themselves and using, you know, our lovely flipboards to kind of note things and come to realizations and discuss what's important. Now, this is also being recorded for posterity. So with that in mind, I have no clue how we're gonna do this, but we have microphones that people can use. So if someone wants to, you know, like I'm just gonna Johnny Appleseed these around, I guess, and people can maybe just put that like there-ish and someone wanna be the mic minder for this side and then someone over here wanna be like, yeah, that looks like volunteering to me. So does someone wanna volunteer to kind of take notes and help facilitate? Cool, come on up. You've been appointed, self-appointed. Yeah, come on. Give it up for Libby. Thank you, Libby. So what I think a good way to start something like this is, you know, this is kind of two things, right? This is EBPF, this is auto instrumentation. So maybe just help take some notes from people. Let's go around and kind of say like, you know, what? Let's basically figure out some brainstorm, some topics, right? And then just kind of jot the topics down and then people can kind of like rank those a little bit and people can sort of discuss those. And again, it's like a dev-upstays-open space if you've been to that, right? Yeah, so do our mic handlers wanna make sure they're on and then people start kind of talking to the mic? Test. Yeah, okay. I'm gonna check on the other room. So I'm interested in knowing when what are the circumstances of which we could use EBPF instrumentation to replace any language-specific implementations, I guess. And you know, that might reveal that I don't know a damn thing about EBPF, how about that? I was just gonna say, I'm behind when it comes to figuring out how EBPF works into my work. So I've read the basic directions but I haven't had hands on. So it'd be kind of cool for somebody who's had hands on who knows how to TLDR, too long didn't read what EBPF is for anybody who doesn't. Does anybody, is anybody brave enough to say it? I don't know. I could use a quick refresher on EBPF. Okay, enough people here and then you've written it down. Is there anybody, well, before we keep going, is there anybody who can maybe two minutes on EBPF to sum it up for people? Henrik and somebody else. Okay. But we could spend some time like that but maybe not now, like, let's get to the topics. So the TLDR of EBPF is historically it comes, like this is literally two decades old, it comes initially from automating extraction of network data and analysis of network data. If you heard of the Bro project or such from two decades ago that's where the root of all of this is coming from. These days the extended Berkeley packet filter and it's still a packet filter but it is now more generic. It gives you a hook in the Linux kernel where you can insert and you basically have a little bit like callback points where you can put more or less arbitrary code which is executed when something happens. The thing which allows you or the thing which is enabled by this is you can have a fleet-wide aggregation of hey, how is my thing actually looking and the thing does not need to be specified. So the good thing is you can across your whole fleet extract data from whatever you're running and it is not specific to your workload. That's also a downside. Of course if you instrument your code by hand you're really near to what you're actually doing and you have all this context and you have it in code. So it's really easy to attach proper metadata to whatever you're emitting. Of course you know what you're doing and where you're doing it and why you're doing it. Whereas EBPF inherently is looking at things more from the outside. Personally I do expect that EBPF will be more of use when you have large fleets and particular hyperscalers and such. Of course they gain a lot of knowledge just from looking at things from this high level whereas if you have smaller workloads arguably directly instrumenting is more valuable. Also, and then I'm going to stop, you're having a little bit of a problem similar to profiling there unless you have your symbols and everything in a format which allows you to walk back on how that actually code looks. EBPF doesn't know what's happening and also you can't trace a specific request that's, I mean it's possible but it's super hard and super wasteful to do with EBPF. It can but on a different level of course it's just on kernel level. The kernel knows what is being executed, what is being done but it doesn't know where this path is being taken. The idea is that you're gonna define the event where you're gonna run some code and the kernel will basically trigger your code so your code should be very lightweight and then you can do whatever you want you can count the number of time there's a socket that I've been hoping for example but then you keep in mind that you wanna exchange that information to your user space because you're at the kernel level and you cannot do much. That's why there is EBPF maps where you're gonna take the data you have constructed store it into a EBPF map and that EBPF map could be consulted through your observability solution that could store that information. So it's very the way you build EBPF programs should be very lightweight in C program and you know exactly to know which event gonna trigger what you're gonna measure. You can do tons of things with programming language as well but again it's very difficult. You don't manipulate the context you just trigger a code when an event happens so you cannot add any annotations, any context to that data that you just been observing. Dumb question, what does EBPF stand for? Extended Berkeley Packet Filter or Enhanced? Anybody else have any other definitional additions to that? Well thank you guys. This is Linux kernel specific. This is Linux kernel specific. This is not, are there any ports to any other operating systems? My windows does support it. Excellent. Oh, fully? So just for the recording it was just stated that Windows also supports EBPF. Yeah. No, just for the recording course, people and that's used to Microsoft. Microsoft has to do it with the W7. Then it's moving. Not W7, it is in the Windows operating system. No. But that should be it. Unless you get it right. Oh no. I'm curious if anybody has practical experience where they've used it in production, gotten benefits from it? So you describe it. Yeah. Pixie PARCA? The Pixie labs is relying on EBPF. PARCA is also relying on EBPF. So we have no information about how it scales, but they are running in production for sure. Control plane, basically controls. We're currently experimenting at AWS. I mean, my, I am, and then some of my friends at AWS. So we are able to collect profile data and then send that to CloudWatch and other services through Open Telemetry Collector. So we're currently experimenting that. One of the... No, no, no. The EBPF has been naturally been designed for network, initially for the original origin of the name. No? No. It's the initial project role. And that's why Cilium, I don't know if you heard about Cilium. Cilium is doing a great job for security or for any networking roles because it's designed for that. So I think it's, I think the EBPF could provide a lot of value for the future of service mesh where you will have more lightweight without any proxy sitting in between your containers, your pods. So I think that's, but for profiling also could be interesting, but again, it's very difficult to say. So far. So, oh. In service mesh, you, most of the service mesh, they inject container in your pods that will do the proxy job. And in the case of EBPF, you don't need that anymore. So you remove one workload in your cluster. So one thing that I think is like a open question and maybe the people that have been using Pixie or EBPF in prod, how are we expected to share context between hotel context and like an EBPF context, right? Because these are working at different layers of the stack and open telemetry itself kind of requires that context layer to be propagated and everything else. Is there a fundamental mismatch here that we need to figure out some way to like push things down or pull things, at the lower labels, be able to kind of like pull that context out? Is there a context that needs to be propagated at the kernel level or the networking stack level that Pixie can, that something above it can then read and associate these telemetry types together? Cause it seems to me like the full hotel context implementation would be pretty heavyweight to run at, on the kernel. Yeah, it's actually not clear to me how you can propagate context in the kernel because contact is fundamentally like application concept, right? So kernel wouldn't know like which spans the child of a different span. The kernel, but Pixie does say it does read the, you know, the HTTP headers and kind of figure out, you know, it can, it's actually it's like, you know, in E-Steel, the E-Steel sidecar, you know, it can generate the spans, but in this case, EBPF hooks can generate the spans, but still the application needs to handle the context. Do we want to drill in more on any one of these? Yeah, so following on to that. So you're in the context of code, you are the program you're executing, you're thinking about the moment of execution. That's usually why we have to layer in stuff into our code, right, to propagate the context. For EBPF, it's just about network, right? And no, it's kernel as well, right? It's hooks in the kernel. Yes, but it's not just about the network anymore. Initially, it's coming from the networking space. Yep. Of course, at what was horrendous speeds of one gigabit back then, you couldn't do deeper analysis without the combination of having a hardware and network cards, which did on-card ASIC analysis, and also hooks in the kernel, and several iterations later, it's extended to basically pretty much everything in the kernel. Yeah. And I think one of the hard humps to get over, if we are going, we were talking about replacing language with specific implementations with EBPF, that's the hardest thing to get over is to deal with the fact that there's a million ways to write threads, sub-threads, you know what I mean, you would have to be very specific in that model, in the process and threading model to be able to propagate any kind of context. And I'm sure you can extract from layer seven the context that's coming in and follow it to a certain degree, but it's really tricky inside, there's a million ways I could write how to get a program to write, and how would you duplicate it? Would you fork it? The context across those processes and threads, just to get what is actually, so I think that the easy sledgehammer right now is to expect people to write this stuff in their code, to propagate their own context, but is it impossible? I think we just need to work on it. The only way I was thinking, I mean, maybe I'm completely wrong, is you have to put a small agent close to your app, it will get the events, share it, create the right stuff, because the agent knows the context, he knows which program is running, he doesn't know which functions necessary, but you have to go through the profiling level, but that's the only way to get the context, because you're getting the event from there, and the agent sits somewhere, so he knows pretty much what it's supposed to do, but that's what I thought initially, but maybe I'm wrong. So you need to have an agent in my perspective close to your app, that will help you to get this context. And so I had a conversation with a guy named Ryan Perry over at Pyroscope, which is a cool little open source thing that does profiling and continues profiling, and we were talking about the overhead, right? When you are actually tracing calls, not only do you need the symbols, as mentioned, right? But you also incur a certain amount of overhead, just to sample that stuff. I don't think that's an intractable problem, because you can decide what to sample, you don't have to sample everything. Sampling every call under the call, call stack, everything in the call stack for everything is impossible, but I think that's the only way that we get with an agent to your point. The profile of what's being executed has to have that context, and once you have that context up somewhere as like a smaller segment or subset of the total data, then you can correlate it with the layer seven stuff, I think. Does that make sense to anybody else? So I haven't looked at this myself, but there was an experiment posted here. I don't know if anybody's had a chance to look at this, but that uses EBPF to instrument the go binaries at runtime. Has anyone had a chance to look at that? I don't know if they have any feedback on it, because I haven't been following it, but I was just bit me in the ear just now. I think there's a project called KeyVal or something which is trying to do this Scoleng auto instrumentation on EBPF level, which is similar to what the Prometheus Java libraries allow you to do there. In the JVM, you auto instrument the code. Of course, you have a place where you can do this, and that's basically applied to the same mechanism. And I think this is super interesting, because at the end, most developers, or at least their managers, won't care too deeply about this being instrumented. They want the benefits, not the actual work being put in. So having a baseline of auto instrumentation would be super interesting. I don't know how this could be done specifically. Like I can imagine that if, for example, you have certain wrapper functions, you can actually, from kernel level, determine what is being called in user space, and then, again, you have a symbols file, which tells you which is which, and so you can trace back what is happening. And in particular, at larger scales, this is super interesting. Of course, you don't have to extract all this data from all the programs. You do it way more efficiently in the orchestrator, in this kernel. I think that auto instrumentation is going to be one of those super important things in the medium term. Of course, it allows you to not spend time on instrumentation. You only spend the time where it really benefits you, and the rest you get more or less for free. It might not be perfect, but you get the baseline. I actually wanted to kind of not challenge that point, but I want to flip it on and say, just speaking from experience, right, you know, Fender, but most of our very successful users of LightStep tend to be people that are in go. And the reason they're successful is because the lack of auto instrumentation go means that you actually have to sit down and you have to think, OK, I'm going to model my system with these traces, and I have to actually go and I have to write that instrumentation. The current sort of thinking about observability in the industry is this is something that helps you create data to model that, you know, to let you say, hey, here's code that describes what I think my system should be doing and how it should look and act and function. Auto instrumentation doesn't necessarily like go towards that goal because all it's doing is showing you the shape of what actually is, and it's usually giving you a lot of stuff that you probably don't need at the end of the day, right? Is there sort of a, from an observability perspective, sort of a secondary or tertiary way of thinking about this where we use auto instrumentation to very rapidly get us set up to the point to do all the initialization, to do the bootstrapping, to give us kind of that basic level of context profit, the application level. And then we have stuff like EBPF on demand for security, for profiling, use cases, stuff like that, right? I think when we talk about this, especially outside of people that are nerding out about telemetry, it doesn't necessarily translate to the person at the end of the table where they're thinking, like, oh, you just mean I need to run Pixie, and I don't have to do any of this other investment in telemetry or creating spans or da, da, da, da, da, right? So some of this, we need to be better and more specific about the exact role of continuous profiling in our systems, have people had conversations about this with customers or coworkers or, you know. I was going to throw it at you all. And anecdotally, I would say that my end users are, all they do is complain that they don't have continuous profiling. And they like tracing, but don't love it. And they want me to do everything for them. And they don't want to write any code at all. And they're like, how come I don't see this stuff automatically up here downstream? So I have the opposite experience, I think, where it's like people want more things freely available to them, like free as in they didn't have to write anything extra. So in particular, goal line programmers are very finicky with me about telling them, yes, go ahead and use these key value pairs in the attributes because they really don't enjoy the ergonomics of the interface of using attributes. So it's like, I'm constantly arm wrestling people, I guess, me personally. I think both of you are right, but you're talking about different things. So yes, I absolutely agree. And that's much as my experience, that our most effective and efficient users and customers are the ones to actually think about what they're doing. But this is not the majority. And also here in this room, we have a highly self-selected group of people who actually care about this stuff. Whereas the normal person is more on this side or on his customers or user side, they don't care. They just want to have the benefits. And yes, they won't be as good, but most certainly not. But they have something. So I think you're just talking about different user groups. But the thing is, his intended audience is like 10x or 100x larger than the people actually care about this stuff. So on the open telemetry level, it makes sense to care about this larger group of people. Just wanted to add with everything that's been discussed about the difficulty for new users using open telemetry. This would add another barrier if it wasn't implemented in a way where they didn't need to know all of these things off the bat. Continuous profiling. I don't know how relevant that is with EBPF. Because with continuous profiling, you want to see even the private functions that you're calling within your application. You want to see how long a particular piece of code is taking. Any exceptions if that's using more memory and stuff like that? I don't know. Maybe I'm just not clear on that. Because with EBPF, you are basically only capturing all those things that you are intercepting with a piece of lines of code when the kernel is basically invoking your code. And that is basically when maybe let's say you're making a HTTP call or maybe making some GRPC calls, then you want to log that. But with continuous profiling, every single line of code that your application is executing is being monitored. That's when your developers are interested, because they want to know, tell me which part of my application is causing problems so I can go fix it, which they don't get. So Netflix did that many years ago with Flamegraphs. So they were with their systems using EBPF. They're able to get all the stack going through the kernel and they're building those Flamegraphs showing you the flames is at the top of your user functions and then go down to the kernel on the bottom and they can track on everything. And then they store every single small profiling for one single timestamp in huge database. And then you can click on one single pixel, which is a timeline of your production. And you can see the Flamegraph of that actual pixel, which is impressive. But they did it eight years or 10 years ago. So for continuous profiling, with my tech observability head on, I strongly believe that this is the fourth signal which is emerging. And do you see this again and again and again after metrics logs traces, depending, like specific order depends on who's the user, but the next logical step after traces is usually continuous profiling. As to EBPF specifically, I think for open telemetry, there's a difference in how pressing the need is. If you have a Go application, you don't really need to have this in EBPF. Of course, you have something nice in Go already. If you have a Python application, it's a lot more pressing to have something outside of this. Of course, you don't have any other place to do this or nothing built in. Java sits in the middle, of course, you have a different place. You can use JVM for all of those things. But in the generic case, I think, yes, absolutely, that continuous profiling is one of the next things for open telemetry. Is that synonymous? There were two mentions of profile, one in, I think, Allelita's talk towards the end of the talk. Is profile synonymous with the continuous profiling topic that we're talking about? Okay, so these are the same things. Good, words matter, okay? Yes. And also, like with both Allelita's and my tech, Chair Hetz on, sorry, I'm still half asleep, jet lagged. Again, this is something which we see as one of the next things which we as the observability community need to engage with and actually solve in a more generic case and establish standards, open formats, blah, blah, blah, blah. Yeah, yeah, I imagine we're gonna get into a lot of language specific, runtime specific translation of every language has their own different symbols and it's gonna essentially be the same kind of situation as open telemetry with different language bindings and stuff for profiling. That's part of the premise of EBPF, of course you don't have this. You get reduced feature set and such, you reduce your visibility at least depending on how it's specifically implemented, but you get this for free across everything which you're running. Doesn't matter what language, run it in Fortran, doesn't matter, EBPF will still be able to extract meaningful data from it. See, now I really have to go home and play. I also wanna point out, like there's a really great, this hasn't been realized yet, but there's a huge opportunity for sort of brownfield monitoring scenarios and EBPF where, like you said, Fortran, right, like there's immense program, observability programs at very, very large companies that have significant kind of legacy technology stacks that they want to integrate. If you're a bank, if you're a big bank, then yeah, you've got 20 billion mainframes still running every single request that goes through your system at some level, so you need a way to kind of connect and view both the new stuff and the old stuff. Or there's a desire for it at least if maybe not a need. For mainframes, you don't have EBPF, but... You've got it at the part that's actually connecting. Yes. So the thing about brownfield is correct and also it's about closed source. That's one of the things with my Grafana head on which I see a lot. People who want to use EBPF have massive installations on things they do not have the source code for. And even if they have it, they have A binary which has been certified by A state agency so they can't just recompile it or anything. They have to work with what they have. They cannot re-instrument us. So to that point, and this isn't really an auto instrumentation question, but maybe it is, we spend a lot of time right now, I mean in my presentation earlier, the most popular repos in terms of activity is like Java contrib and collector contrib. So there's obviously quite a bit of energy going towards zero code change instrumentation. Do we think that there's a real opportunity? I mean, I'm gonna answer my own question. Like it's never going to be anything other than that because like you said, there's a lot of people that have binaries that have programs that they can't make source changes to or they're on old versions of things they're never gonna update. Like even if you got to the point where you said, okay, every single framework has open telemetry built in then that doesn't matter because people are gonna be running five or 10 year old versions of spring framework and they're gonna need that auto instrumentation plugin to work. So what can we do to sort of make it easier for those auto instrumentation libraries to get created and maintain? Like I think there's a problem right now with sort of the breadth of it, right? Is there something that we can be doing as maintainers or as people that are involved to sort of ease the maintenance burden of auto instrumentation discussed amongst ourselves? Well, I have a mic so, and just before we move on to the maintenance problem, Henrik and I probably know from the performance and reliability space and then moving on to SLO space that there's always gonna be people who are like just do it for me, right? Just make it easy, like magic box this for me and I don't wanna know. The tension is not what you need to know in order to make that happen. That's what we need to solve for. How little you need to know about the thing but also we have to require people to provide the context from their human brain into that, right? And so the next layer on that, what I've found works really well from people who are like, I just want you to tell me what my performance should be. I just want you to tell me what my SLO value should be. Is like, okay, give you a little bit, like you're saying, fine, auto instrumentation, just to get them to the point where then they have to ask that they see the values, they see a little bit of value out of it but then they're like, I wish it could do this and then you can say, what did I freaking tell you? You have to provide some context into that to get that information out, right? You have to wet their whistle. You almost have to like stage the question in their mind is why is this not good enough? And then it becomes, they start to own that problem a little bit more and that's the idealistic way of saying it it's hard to get that going but I think not trying to solve the 100% for them is actually the way forward, right? It is solve it just enough where they start going, I need more, good, you need more, you need to do this thing, you know? So anyway, so then going back to what Austin was saying. Just one thing related to the example of the Fortran code, legacy code, EVPF requires a certain version of the kernel. So even if you have auto instrumentation on EVPF, maybe even this company using legacy code won't be able to use it because it doesn't even run on the right kernel version. Yeah, I mean, there's usually some sort of like connector though, right? It's something a little more modern that's actually managing the RPC, you know, that's proxying your RPCs from whatever your fancy new stuff is into your less fancy old stuff. So it's not obviously a flawless thing and like taking off my whatever hat and putting on my like Austin's personal opinion hat, I think a lot of the driver of this is really almost sales driven in a way from like the vendor side where we're really just trying to find something that is reducing time to value, right? And EVPF seems like the ultimate like, oh, you don't have to do anything to your code, just run this and boom, you get all this data. You know, maybe that's cynical of me to suggest like I do think there's actually really good applications of it, but it's hard to not do it somewhat cynically. But that's a side, you know, that's just an aside. I absolutely agree that there's a huge hype around EVPF and not all the promises will be realized as is usual with any hype. I mean, we are sitting in a room talking about how to use EVPF with open telemetry and the majority didn't have a firm understanding of what EVPF is, which is not like meant negatively in any way, but as a statement of fact, the hype was what has driven people into this room largely not that they already have it in use and they already derive value and not all of this hype will be realized, but I do think that when we have at least this auto instrumentation thing for all your old stuff and yes, giving incentives, but it's not about telling people, it's about showing actual value and then they want it. Just telling them is usually not a good way. I think that's the main premise of all of this. I'm just saying, I'm just saying the point there is there's only so much that you can build for people, but your real goal is to compel them to change, to act differently, right? Yes. And so it's like Eric Prigler says, right? If you ship a report to somebody and they don't do anything different, you have failed. So it's kind of like if we provide the very best, it's got to be just enough where they're now able and in a mental place where they're asking the right question, you know? Not all of them. I fully agree that the majority doesn't care about all of this. From their angle, this is just something which is service or infrastructure. Water comes out the wall, you don't care about how electricity comes out the wall, you don't care about how. And I think there's a good argument to be made for observability to also come out of the wall to at least some extent as a baseline. Part of the problem is with all the cloud native and horizontal scalability, blah, blah, blah, blah, blah. The fundamental workloads are basically the same. They're the same for the last 50 or 100 years. The service delineations and which part of containment we put what type of service and what scope of service into, how to do the orchestration, who does the actual orchestration, what is the scheduler, all of those things have been broken up and rearranged and rejiggled. So a lot of the things which we used to have with classic boxes and classic servers have gone away. And a lot of this, it's just there and it works and it just does the right thing for you as a decent baseline needs to be rebuilt. And that's where I see much of the value. So one thing I wanna clarify is as far as the experience I was describing, it's more about the work required to collect the information. And I think analyzing that information, analyzing the data, the traces, the metrics, that's partially on the vendors who implement solutions that help you generate insights out of it. And part of that is a change in attitude, right? Building an SRA mindset, because I'm not gonna be the one that can describe what your burn rate should be and what your SLO target talk, that's up to you and your product manager. But I think I wanna go back to kind of like where, the big burning question for me is, how can we leverage EBPF in auto instrumentation? What are the constraints that prevent us from doing that? And what problems will we run into if we, for me, in the Ruby sake, we are rewriting code and constantly dealing with changes in libraries that are getting pulled in. And we don't have that many maintainers to try to deal with that. And the, from my perspective, the virtual machine isn't very amenable to you going in there and making changes to it so that you could add instrumentation on the virtual machine level. And then we have different targets, right? We have Jay Ruby and C Ruby and who else knows what other Ruby's gonna come out in the future. So those are all maintenance problems for our team, which is very small. And I'm trying to figure out a way to provide value or to make data collection easier, instrumentation data collection easier, and minimize the code that we have to maintain. And so part of this is my ignorance again about the promises of EBPF and what we can't do and what we can't. And the other part of it is kind of like, is again, just like me trying to take workload off myself personally at my teammates. So I don't know, like for anyone who's, I ask again, is anybody had an experiment with this, trying to use the EBPF or auto instrumentation of a language? Does this make no sense at all to do? Should we not even try? I don't know. One thing that we are currently exploring on the community demo is popping Pixie in there since they started OTLP export. And trying to figure out like, what does it look like to have, you know, like, is there a way that sort of, given what we have now, we can actually make correlatable data between like hotel at the app layer and then EBPF at the networking layer, right? And if that, you know, that's gonna be when we figure out their experimentation, like you can go find Carter and ask him what he thinks about it. But I think I'm the only one that's actually open an issue about this so far. But yeah, the goal honestly is, I think the only way people are gonna understand it is if they have something that like looks and feels like a cloud native app that they can kind of like put their hands on and say, oh, this is what EBPF looks like in production, right? So here's like a sample app that you can kind of compare and contrast. Like, here's what I get out of hotel with traces and metrics and logs. And then here's what I'm getting on EBPF. And it's coming out as a trace. Yes, but these are tracing fundamentally different things at different layers of the stack. Is that data useful from an analysis perspective? Ah, that's for our friends at New Relic and LightStep and Grafana and everyone else to figure out. Anyway, that said, it is half past. So if people wanna switch sessions or if you wanna stay here and keep talking about this, cool. But I wanna give people the opportunity to rotate between the breakouts if you like. Great discussion, by the way. So if you wanna stay and continue adding to the board. It probably makes sense to reiterate what the other sessions are. Yeah, the other session is intermediate open tracing and context prop. Is that it? Follow? Signal correlation, yes. I believe you were the one who suggested it, so. Okay, good. Between them way, we're talking a lot. Yeah, yeah. I think it was a little bit. Yeah.