 Thank you very much. Thanks everybody for being here. Thank you guys for being here appreciate it This late in the afternoon just before the reception and the drinks and all that good stuff after a long day of talking about eBPF It's time to talk about actually putting it into production Which I love end-user panels because I think everybody always can learn something from those eBPF is a little bit different. I think because most of you probably don't consume eBPF directly Google dust maybe Most of you probably don't right? So it's kind of an interesting View of the ecosystem. I think that's different from others But before we get started, I'm Frederic Clarke and why I'm the enterprise editor at TechCrunch And maybe the rest of you can just briefly introduce yourself because I don't think we ever slide up with names Yeah, hi, I'm James McShane. I'm the engineering director at Super Orbital a small consulting firm Hi, I'm poor with a sigh. I'm a director engineering at the Google cloud focused on networking for Kubernetes Hello, I'm Daniel Bernier a technical director at Bell focusing on our telco transformation towards cloud native principles Hi, I'm Andrew Sauber. I'm a staff software engineer at the New York Times working on our cloud platform team Awesome and maybe to get started a softball for all of you How are you using eBPF at this point and we'll just go down the line for this one? I think yeah, thank you We're we have a number of clients that work directly with psyllium specifically, but also I've started to Extend that with some of the observability capabilities as well So we've been interfacing very closely with that from a network security standpoint And then extending that into the observability realm over the last last year I think I have to answer this question from two perspectives. Okay, one is Representing the Google hat from Google perspective. We have various groups using eBPF In a wide range of application. It starts from networking compute security Telemetry of course, which is like the key And we have been at it now since years So I would say we are also we have been installing eBPF programs now almost a millions of machines And almost all Google traffic. I would say it touches or is touched by eBPF program one form or the other so that is the one part of Google and Actually, if you see even the eBPF core coming Contributors, we have lord of it is 25% of the folks and then similarly from the Steering committee we participate in steering committee too Now the second angle, which is more of a Kubernetes engine or GKE and Anthos and GDC there We started the journey of using eBPF based a top lane around in 2019 I would start even 2019 20 early 2019 and when then we Looked at all the ecosystem. We picked salient because salient was a vibrant ecosystem with a lot of the Feature sets that we needed and also on top of it something that we can contribute to so we have taken that into our main Products GKE Anthos, GDC, GDC, Edge and all and it is in production for all of them So in today's state The GKE fleet is rapidly moving to this two-day top lane and all the others are already on this Daniel how many million chorus Divided by X So I know we don't have the same so you have to trans into eBPF We are consumers of cilium and the full suit of observability of all we started now with tetragon everything As a consumer we also have as a consumer the data plane v2 from the Google environments when we use Google's version of communities and From that's more like production oriented. We also have our development phase around Because we have we do some p4 coding So we're also looking at programmable data plane using the leaf effort that came out of Walmart around the SDK for eBPF so we have that which is more exploratory to get ourselves in the foot Yeah, so I I joined the times just over a year ago and I Was asked to help lead the team to build kind of the next generation Kubernetes platform When I joined I can't give exact numbers, but there are many many many clusters kubernetes clusters in use of the times and they kind of all had their network policies Kind of delegated to the cloud provider on which they were running Because there wasn't much Strong isolation that was needed within those clusters Now we want to do this huge multi-tenancy thing which we are doing and a lot of production traffic is already flowing through it and we needed fast Network policies and cilium is just the obvious choice if you want fast network policies Nothing like re-architecting your whole architecture. So people can play wordl and I mean Yes, people can play wordl Now a lot of you are using cilium obviously, right? Can you talk a little bit while you're using where you made that choice and you know because there's other options obviously Yeah, definitely we went through an evaluation process for the the main product we were working on through our firm was a Product that allowed for a real dynamic set of Executions it was within a financial services firm and so When you're allowing like data from a number of different sources to flow through your production system You really need to be able to understand kind of everything about the the network path and network execution, right? Where are you reaching out to how large of is the data that's coming into the system? And how large is the data going out of the system and then get that extra context from the Kubernetes API and So we needed the l7 filtering that cilium provided and then they they made it really easy to Pull in the context that was around that enable and enable the appropriate level of Experimentation that the product product required While still locking it down from a security perspective. So we started from the network security you know field in terms of evaluating the product and then What a ebbf allowed us to do was extend that into the You know execution level environment We've heard all about the capabilities of you know ebbf filtering and things like that today And so that was you know we started in the network security path and then went into kind of host security and Gathering those that information we needed to make decisions about whether users should be allowed to do things How much of that and maybe some of you can answer the same question But how much of that because it's a still early days for ebbf right how much of that felt like an early bet on a technology? We had some we had some hiccups along the way But they were it was more about some of the complexities around the the layer 7 filtering Implementation within cilium which ends up delegating to envoy on the host the ebbf portion of that has been Rock solid in terms of our use case of it You know in terms of the amount of data that we're piping through the platform. It's the performance has been spectacular Maybe I can come in In the world of communities almost everything is new technology, so I've lived opna v with ovs and all those things so I have some scars of this open stack Yeah, I know yeah, but the networking parts was ugly I think the fact that we when we looked at this from a technology perspective. It's the I would say the backing so Early on and when I started to work on the networking side of the clusters and the communities How was a really big fan? I'm still a fan of for example vector packet processing VPP But if we think ebbf is can a derelict a surreal the ebbp coding is even worse So you realize that there's only a few and full of people who actually know how to code this While a bpf everybody is actually involved. It's in the kernel. It's it's already part of that mainstream So kind of it might there might be some paradigm shift and there is there are paradigm shifts to this But you feel a bit more secured and and that's one of the drivers when we're worried. It's actually a Pushed us towards that technology Oh Yeah, I think your question was a why CDM And let me answer this question also a bit very interesting because I think Daniel covered this very important topic about VPP and other data planes So and then we had to really look into newer data planes and we needed new data plane because we need flexibility We need future velocity. We need composability and we need a lot of strength From the data plane where we can program at ease So we did evaluate multiple sets of data plane and imagine we are in Google So we have a lot of technologies within Google starting from our USPS or Starting if now you can actually when VPP we had actually looked into so Overall when we valued all of this one thing was very clear for us that needs to do we needed to have ebbp The reason why is also very interesting because we wanted to be close to applications It is important to have the application context to be available. That was one second the ease of programming Daniel just mentioned it. It is it is super easy at this point of time To build on top of it now the question was why see them and for us again There the answer was rather straightforward when it started off So there was an option of either we build something inside internally or we adopt a open source project which was Active doing very well So from that perspective when we looked at the landscape, it was a selium We have been interacting with selium community from Kubernetes angle-through. So it made more sense for us to adopt a project which we could contribute to which we could use and And that is how we ended up System let me ask there since you brought it up if Google it build its own version How would it have been different like what would you've done? We have a name too by the way Not talk about it right now. We have a code name We're still working on it. Is that what you're saying? So the way it is is that Right now We have decided that selium works for us. So we are working with selium community Thomas and others on Modular modularization of selium. The main thing is our customers want our users want Power of and as we call it where they want to use open source function Or they want to use enterprise functions of isovalent and they also want to use all the goodness that Google is bringing So when we have to do this, there is a need for building that modular architecture. So you're working with As selium on this a very actively to make selium act more like a platform on which we can bring multiple Network functions on it At this point, I think this is what about Nord star is Nord star is to have the dynamic marketplace open ecosystem and We believe we can achieve with this. So fair enough. I had to ask that Andrew you've been looking skeptical at times over there. Oh, no, I'm just thinking about my answer to that question So earlier I said selium is the obvious choice if you want fast network policies But obviously there are other ways to do network policies even with BPF in kubernetes I think the reason that we went with selium was more so around kind of the community and the Kind of like the pedigree of the Team working on selium in the past we've gotten on slack and Asked for features that appeared a few months later with a full test suite and it's Yeah, it's just been great so far Yeah, I guess I wanted to second that point is that the the community around the selium project is really strong and We've enjoyed, you know, we've gone in the slack and then contributed pull requests and issues as As we were diagnosing, you know, we had an issue that we ran into with some large-scale processing that was going through the L7 proxy and out to out to the public internet from one of the clusters we were working on and You know diagnosing together with the team through through over slack as well as you know putting putting that information into Out into github and it turned out there were a couple other folks that were experiencing the same problem and we're able to narrow down and that that kind of Collaborative approach when we've got you know, especially in an enterprise environment where you you know You have to be careful about what data you're able to share and things like that But we were really able to work well with that team and felt confident in the community as we built our product on top of it So what's what's your sense of the community right now anyway because the community is so important to a project like this, right? Then you want to what was kind of your sense because you've been here all day as far as yeah Well, I do like the community a lot. I think there's leadership in the vision of what ebpf does It's actually not it's actually rare that you're able to see like the group that the maintain the kernel pieces of it be everywhere if Reachable every day actually any time in the other day you slack them. They're gonna answer you have a question The breadth of the skill set we've seen from the group John festival that was speaking this morning Was the guy doing LLVM and p4 I didn't tell before so you think about they can I want to do ebpf I'll float the as this brains already are hooked on this So it's really strong and plus the user community is also quite impressive It's rare that you will see all hyperscalers involved in the same technology and adopting the same technology normally you end up because it's it's a Fight that you have to feel deal with now and no it's all agreed upon so that's actually quite interesting for us It's all some of our issues around Finding the right platform that can run everywhere it kind of getting there to that level so that's interesting to us One point that I want to build on that was that as a part of the community It does seem like there's a great understanding from the the maintainers who are working at the kernel level and the you know The real nitty-gritty of the ebpf code that end users, you know, don't experience, but there is They have a good understanding of where that ends up when it comes to the user's hands, right? I think John when he was talking this morning about all the context that end users want to see when When stuff comes out of the kernel right you don't want to see you know I know numbers and you don't want to see you know network network name of space IDs You want to see the Kubernetes data, right and and bubble that up and potentially even get data about You know cluster and you know the even higher level data that can be gathered so that because when you're trying to operationalize Products in the space like Andrew said right you're not working with a single Kubernetes cluster I mean, that's that's not how anyone deploys these days, right? There's hundreds of clusters potentially thousands of clusters each with their own set of nodes So you you need that context to be able to you know make an actionable decision from you know Maybe an event that happens on a node that has to bubble up across, you know pod namespace node cluster a region, you know all the way up in that level and We've been talking about having all those clusters to deal with is that the problem you're seeing as well um, so We've kind of sidestepped that problem for the moment We don't have things like psyllium mesh set up between some of what we might call a legacy cluster you sidestepped it by not dealing with it Yeah, the meshing integration routing to those systems is Currently in development what we do have is a multi VPC setup in one of our clouds right now which is where a lot of the psyllium network policy has come into play and There's something we said earlier that sparked something and now it's it's living my mind. I'm sure we'll get back to it Sounds good now as you're putting all psyllium and everything else into production How much of that how much are you relying on your own in-house experience and how much are you? Outsourcing to two vendors at this point because you just mentioned a couple of issues where I was like That's kind of a problem for the vendor maybe more than the community But kind of what's the state of play there for you right now? It depends I would say depends It depends on the ugly your applications are if you end up with in-house grown applications that have like I have July move systems that came out with a Vax Vmx system still so it's not something that you can easily part it to an application into a communities cluster when you start to migrate so there are some hooks to do and This is where you like to have more and holding from the actually the psyllium guys around Is there a hook you can do any BPF to try and make it work? And actually that happens something I could not do if I was mainstream any kind of other technology or even the major vendors Which are normally rely on I end up with it's a road map will have it in three four years Maybe two years because you're the first customer that asks it and the volume is not that big That's the end story honey BPF. You don't have that issue anymore. It's kind of it can get fixed For those kind of things we need we're more involved because we know our applications which are sometimes really custom For other mainstream we actually leverage we try to leverage as much as possible the ecosystem safe It's not if it's been solved somewhere else Why do we need to re-invade the wheel although I just did a talk about reinventing the wheel of networking, but That's pretty much it Do what I say not what I do James well, yeah, I was gonna say You know a good consultant answer is is it depends and I second Daniel there and and say that I would you know the the large percentage of Enterprise customers that we see you know their their application integrations are not that deep but When you get to people that are working in more unique environments, we have a customer that has You know devices deployed on to on the drones and things like that Those are the people that that have a more deep need You know when you have like unique hardware unique network requirements, right? That's where the hooks and the the capabilities that eBPF provides to just to modify your execution environment without needing, you know You know from the outside right you you can really reason about things nicely And I guess I've drifted away from the question a bit but I think the community is strong the vendor ecosystem is strong and There's I think there's a fairly obvious line when you when you start to go down into the implementation of when you know When is this a vendor problem? You know, there's there's a strength the community you start there and then you build on you know build what you need Well, let's start with the community the annual Erring of grievances like what what do you need from the community right now? What what are you looking for from the community? All right, so I think for us first of all community is all of us. So that is the key thing Now even there there is a nuanced answer So when we talk about eBPF Generally become what we need basically is an Ability for us to really have better controls on who can install eBPF programs What kind of programs can they install in in a way? We're almost looking at a eBPF program registry Who's owning it? Can they really install those hooks if and what hall have they installed all of it? So that is one part of it which is largely from the eBPF community and as I said It's really made we that that is a need for us from throughout Google as well as my own products then comes the Second piece of it, which is we as a provider of the Kubernetes engine and did a plane for that over there The biggest thing that we need is an ability for us to compose multiple like putting together multiple Technologies for our users and at the same time we need to be sure that we are It's well tested. It's it's hardened is better tested because when we are supporting Let's say 15,000 nodes cluster and all of that it really needs to scale so we can't just have this that bring whatever you want on top of this and then You know, it's hard tough luck. So we basically walk a line where we Bring an opinionated solution and when it comes to let's say Kubernetes on the top lane, which cheeky and toss you to see and there At the same time, we also want to enable customers to have the flexibility So from the community there We need an ability to really bring all of this together making it modular making it a plane model being making it composable and It it should all work together so One of one of the things that I see when users really look into this problem is they all need in some ways it to solve the business problem data plane is a means to an end and That all that business logic of the control plane is also a means to an end But they need to express that Control plane and the idea is if you're able to express it in a manner, which is consistently Executed no matter what data plane is running that is what we need and In a way, that is what we are marching towards It sounds like one of those classic enterprise moments in a startup's lifetime where oh we now we need enterprise controls and it always is right, there's always this It is a real demand from our customers right that I need to solve this problem with this and We need to make sure that is safe way of introducing it in the product, which is going to be used by everybody So that is that way One challenge I see at the moment with the community not really the community I think the overall ecosystem of ebp f it's kind of a trendy term right now It's like if you think about something it needs to be AI ML or it needs to be ebp f So the trick is that the ebp f is not locked to psyllium. Anybody can build platforms based on ebp of technologies so you end up having I would like kind of in my life my Utopian world would be if I add a version It's probably psyllium could be kind of the sonic psy version of things like a common data plane that most people can use So that I end up with having I do an RFP I have 25 versions of an ebp f data plane that almost sometime do the same thing But not just enough to make it different But that's where I'd say there's common LAT that could be gained out of this that may be probably the 80% 8020 rule fits One data plane model existing for everybody pick this one if you want to come at it customized Then it's when the what pervy said about the modularity can come in but not reinventing Every time that that's right now my scare around that ecosystem So it's a trendy thing every will want to build their own version and at the end I'm going to end up with 25 versions of a data plane using psyllium ebp f so Yeah, one of the things that I took away from the day today was there are a lot of really powerful tools Operating at the you know language protocol level, right? We we heard about g rpc htp1 TLS interception right and each one of those had its matrix of things that it could do and operate But that's not the way that we think about it from an application perspective, right? You're your application has to talk to you know your database your other services and the protocols aren't You know for that communication aren't generally dictated by you saying I want to instrument my application with you know With doing you know this g rpc support because I've got this great library that's gonna allow me to observe it, right? And so what I what I'd really love to see is an ability to one and from that last talk We just saw virtually right there are a ton of really excellent tools for data collection When you get you know on the node within the context that you're having an issue But I need to think more from an application perspective of like you know I've got all of this data that I need to gather About how this thing is communicating and making sure that it's you know ensuring it's successful and ensuring appropriate operations and How can I take all these powerful individual tools and wrap those together to make it something that I can you know I can act on from From my application development perspective And yeah so there's two Things that I think we need from the community at the moment and Would help to contribute to the community and I have an anecdote attached to one of them So I really enjoyed Tomas's talk earlier from Tijera about debugging ebpf Because in the in the past, you know, I've worked on teams where we got used to staring at the massive IP tables dump from the CNI and like do the jumps in our head okay, this service should be routing to here and It was a little bit more understandable in that regard if you had experience doing that type of thing and if you look at the Cillium repo and especially in the xdp program itself you see that they're adding verdicts adding Reasons that packets were dropped and associating them with the network filters like it's a very active area of development in Cillium and This the story that I have to go along with this is we were experiencing kind of like the worst kind of Heisen bug where pods would just randomly stop Being able to deliver traffic and we were on Eks and What we were able to do at the time was actually dump the btfs and do what you described that like you wouldn't necessarily expect a user of Cillium to do and at that point we were staring at this bytecode and we're just like Well, we know that it's the same bytecode for this pod and that pod and this pods working and that pods not working so it must not be BPF and It turns out that there was a race condition in the E and I Allocator in Cillium that had already been fixed by Data dog shout out to the engineer at data dog that fixed that problem but it was kind of like a Wildwest in terms of debugging and it kind of relates to this observability question of How can we tie what we used to think of as a socket with an IP address? I mean well whatever NF tables type debugging to This new world of you have these TC hooks that can just redirect flows to other interfaces and and that type of thing And I think we're seeing the tooling be developed. So it's really cool And that bug is fixed. So that's yes. Oh, we deployed the new version of Cillium and we haven't seen the issue since then I want to just kind of Mike I had something to this in terms of observability. I feel that Let's say whichever ecosystem you're running in the Kubernetes cluster even either it's bare metal or on a cloud provider and Cloud providers they bring their own SDN layer under the hood. I think there is a need to connect All the context right now. It seems like the context of each of the Observation point First of all, they are not consistent and Even if the context is consistent, it is it is unclear at this point of time If our users are able to even connect together in in a proper logical manner to say, okay Maybe not the board. Maybe it's the maybe it's the Node, maybe it's not the node. Maybe it's saying to connect. Maybe it's not this maybe something else and I think the whole observability space is something that I feel really needs to kind of come together in a way where There is a context which is understood by the application Developers or platform owners or ad network admins or security admins a way where they can all make sense of the same data In a different manner with the same context That that is a actually a big challenge and I don't I'm not necessarily considered that to be a BPF problem or it's it's not that It's I think you BPF helps in a way to observe all this. That's great that you can get the application context along with the Network which was which was really not possible if you were not running close to application But then after that there are mother layers to somehow if you are able to kind of put together a context And there is where I think at least all people are even trying to solve this problem for the end our own end users Yeah, I think that's a you bring up a great point about the fact that you know This these kind of this kind of tooling is awesome at the Kubernetes node level When you're when you're that close But then there's these other context that get layered on top no matter, you know, especially in the cloud providers Right where they they've got their Their network layer on top of that including their own network security that then you get to You you lose the portability that you that you're experiencing with Sillian at the Kubernetes layer Then you still have things, you know, we're working on a product that was getting replicated into each one of the major cloud providers And it's like yeah, Sillian was great when we were talking about, you know Isolation on the you know the pod executions, but when we got out of that and we got to the cloud providers, you know It's there's just not a common language, right? And and you don't have You know, I'd love to see something like Sillian identities that are so powerful inside of a cluster And this is potentially where you know, like a multi cluster mesh and things like that could be helpful as well There is where I think what we have taken approach is making it Kubernetes centric. So as long as I think it is Kubernetes first developer first. We like the developer first model of Kubernetes so when it comes to that model if you're able to annotate for example things that are being generated the telemetry and other things that are being generated from the other Instances in cloud if we have the similar Kubernetes context to it at least it solves the Kubernetes problems That is where we are aligning towards Andrew you look like you had a eureka moment not so much a eureka moment, but actually I did have a eureka moment so in terms of looking at things from the pod level than the node level and the interconnect level we were looking at Hubble and we could tell that if the pod was on the same node as the What appeared to be the broken pod because the issue was that the the IP associated with the pod Would be dropped at layer two on the Amazon side but if the pods on the same node then cillium is just gonna route the traffic and it's gonna it's gonna You know the flow is gonna work and it was Kind of like we actually had that level of insight where we could look at the node and Luckily we're all within the same cluster. So these things had Like consistent identities within the cluster and So What was my point it my point is like we saw it like earlier today We we saw that the next version of Hubble is going to have Grafana style metrics integrated into it, which is awesome Sometimes you look at Hubble You show someone Hubble for the first time and they don't realize how Powerful it actually is that you're seeing the flows at a pod level annotated with these Identities and I think that's gonna help take things a little bit to the next level where you're gonna be able to correlate Metcha Yeah metric type events between your systems at least perhaps visually You're gonna see on how much trouble I am from observability Me I have to think about communities of service ability with a BPF And I have to think about a 5g car with nw da from 3g pp. That doesn't care about any of this They have their own model I end up with our residential gateway that does TV streaming with TR 369 The USP model which has yet another not observability model that fits So anything that actually gets the application level. I'm happy because it's the one thing I don't have to care about anymore. It's fixed. I've got a bull. I've got The dragon I've got open telemetry. I can jagger trace the thing I can see everything my trouble is really when these things need to get integrated with the rest Now my I can my stress factor goes quite up because I have way more work there than in the cloud piece so For me that that's not that's not a real issue as of yet when I my issue is when you look at all that ecosystem of And out and I I need to have a correlation around why didn't my TV streaming app on her home that failed because a cluster pod Inside some kind of centralized cloud broke. This is where my end-to-end observability needs to be more standardized That's my child some of my challenges All right, I think our time is pretty much up here. Thank you very much for for being here And we'll do it again next year and we'll see what we talk about then Definitely, thank you. Thank you. Thank you very much