 Hey everybody, my name is Manu Ibotel. I work on the BPFCI at Meta. Those are folks I've worked on that Danielle who's in this room somewhere I guess and over there and Micaela in the back also. So one year later because Micaela presented it last year at LSFM and BPF 2022. I'm not going to go over BPFCI how it works. Just going to give a quick primer here. There is like a link to the actual presentation from Micaela there. So essentially what BPFCI does, we have three components, patchwork on one side, GitHub and KPD, kernel patches, demon. KPD pulls patchwork for series that are active that they get from the kernel mailing list. When we see a new series or an update or whatever, we're going to create, we're going to take the patches, apply them on top of the BPF or BPF next branch, apply some changes to be able to run tests and create a PR on GitHub. GitHub having this change, the PR kicking the GitHub action workflow that was mentioned before. That's going to dispatch work on different GitHub runners, different architectures, get the result back, KPD get that and then send the result back to patchwork. So maintenance can get an overview of the changes. Mentioned that yesterday. There is a way to run tests in the CI without going through the mailing list, which is pretty useful if you want to play around and you don't have something which is yet ready for review. So here again, there is a link at the bottom that goes a bit more into the details, but the TLDR is you need to clone the kernel patches BPF repo into your own personal GitHub repo, create your own branch, create a PR from your branch to kernel patches BPF and then wait for CI to happen. So what has changed since 2022? So last year we were running CI for x86, 64 and S390X. We were only building the kernel with GCC, building self-test and running the self-test in a VM. This year we've added on 64. We also build the kernel for x86, 64 and on 64 with GCC and a little VM. We still build the kernel self-test and run the self-test, but we also recently added a feature for support for various stats that was added by Edward Zingerman and these allows to catch regressions. So essentially using self-test from previous branch BPF head build, run them against the local PR and see if there is any regression. And at the moment there is about 75% of the kernel slash BPF directory, which is covered by unit test or by test in general. What are the changes? We moved from running x64 in the VM to bare metal. One of the main reason was what is faster, but it's much much faster for ARM64. As an example running one of the issues AWS doesn't present KVM, they don't do nested virtualization. Running the ARM64 VM test in a on VM was taking about two hours. With KVM, it takes about four minutes. So I probably didn't use the right flags and all the right options, but there were significant change. Money quick question. Yeah, didn't you say a couple days ago that we were running AMD and sorry AMD ARM in a QMU on x86? So I did when I was playing around I did tried to run the ARM64 VM in the x64 host. These took about 20 or 30 minutes. Okay, so that was just an experiment, but actually in prod we're running it on bare metal. The four minutes was a bit more appealing. So yeah. So that was a that was changed needed essentially to be able to add the architecture. Overall the build time at the time when from for the build and test time went from 14 minutes to eight minutes give or take. There were some of the changes. Historically we used to build and then run all the different tests from from self-test which is essentially test, map, test verifier, test prog, test prog snore LU32. Now we build and then we run these different tests separately. So we get a bit more parallelism. It costs more overhead to build the VM images, but at the end it runs faster and also it makes it more obvious what is failing. Daniel also added incremental kernel build. This last I checked doesn't make much difference on the beefy bare metal machine that we get from AWS. It doesn't make a difference for the S390X machine, which are not as not as beefy. And it does make also a difference if you run GitHub or still runners or less powerful machines and a small improvement that but given how I like say was happy with it. I think it was a big one and Henry was pretty pleased with it too. Historically it used to be odd. We knew the tails were at the test were failing and But when the test failed you had to go and scroll through the GitHub UI, which was quite difficult. You could through a few clicks go through the raw logs and try to find your error. Since then we've added a way to essentially present the actual error to the directly at the end of the test and you can expand the error and get the actual error logs. So here you see essentially the test failing is because we get 10,001 packets instead of 10,000. Yeah, so it's not a big Super complicated change, but it makes a whole lots of difference for the maintenance. And another thing is observability. So I mentioned the The runtime went from 14 minutes to 8 minutes. It's give or take. It's kind of I was roughly squeezing my eyes trying to look at how long it was taking, how long after and kind of average that. But essentially we didn't have any any metrics before. GitHub does provide a UI that gives information, but it's pretty limited and But they do also provide a REST API. So we built a tool to get that information and throw that into our backends. So at the moment we use internal meta backends because they're just already available. We don't have to deal with infrastructure. We just check the data there. We got stuff to present it. But essentially this helps answering questions like when did the test regress? How long does it take to build this over time for the specific compiler, for the specific architecture, the test run and stuff? And this is how it looks like. For instance for the failing test, then we get to be a we can this was taken last week. It's obvious that this BTF dump was failing on all architectures and all different compilers. I don't remember the cause. I think Martin fixed that. There's also something wrong about the suck up set suck OPT on a 390x GCC. Something is wrong with on 64 for that specific test. So this is pretty valuable to understand like where things are failing and should be that like, you know, widely across all architectures or not. Keeping in mind here, we this is taking the full request from people. So, you know, it's kind of expected that things may break when they do the first the first attempt. But if you got something which is coming pretty regularly at the top, you know, it may be actually totally broken more systematically. Then we got stuff like net C&T, which is more of a flaky test that comes and I will talk about that later. Build time. I was wondering, do you have to say it's a public dashboard? No, I will get to that later. That's why I mentioned like, I mean this was, this is something we have available within meta infrastructure. I didn't have to resolve that problem at the time and, you know, so it was one less problem at the time to deal with. But yeah, so essentially having this data, then you can get a meaningful information that you can get back to and make better decision as to what has to be fixed or when things started to break. This version time, so yeah, essentially P94 for the build time is from the second. Here we see different steps for the test. It's like test verifier, which is mostly here, test maps, progs noel U32 and then and then the rest. Flaky test. So the way GitHub actions work is essentially you run a workflow that workflow spins up different jobs. If all jobs succeed, the workflow succeeds. The graph is pretty spotty. That's because by nature RSEI tend to be quiet, let's say on the weekend at that time of the day and then bursty when people work a bit more. But essentially we can say that we are roughly in the 90% job success. But these 90% job success translate more into a 20% successful run successes, which is pretty low. So essentially the the CIS rarely fully green and it would be much better if it was green. So when something is red, it's obvious something has to be fixed while at the moment you still need to look into more the details. This is red because something is flaky or not. So yeah, they are they are bane essentially. These tests failing because there is 1001 packets instead of 10,000 breaking every soft and sometimes not. I just took these two examples, but I don't I don't mean to point at those specific tests. There's others. I think I've got a feeling that usually running the test in a VM kind of make the problem worse and also probably more difficult. The fact that they're flaky, it's actually pretty hard to reproduce. So it's hard to just like reproduce and fix it. It's like, you know, you try two, three, five times, ten times and it doesn't reproduce. You know, how do you go about it? But I think as a community, we should really go and fix them because until the CIO is fully green, then it's going to be very hard to be able to keep like something something clean. So Manu, while we're on that slide, like we have a bunch of networking related folks, right? And I think like in some previous cases, like similar flaky tests were fixed by like using net and ascent stuff like this. Is it like a kind of a general solution that we can use? Can someone advise? Like if the test is counting number of packets and like you're running it on like in VM where like potentially something else is happening, right? That will be flaky, but like if you do it in network news space, it should be more reliable. I think. So that's where M-Space is soft, it's part of the problem. But for example, this packet counting where it could be could could be some snowless or it could be it could be it just sometimes the packet maybe get we send you count one more. So in general from now on, we tend to be super careful when when we're viewing the test code on this kind of situation. Yeah, maybe we need ranges or whatever to account for that. But like in some cases we miss some events like 22 instead of 25. But yeah, that's the thing, right? Like if you have flaky tests, they're gonna fail at some point and in the short term, the best thing you can do is to try to set up an environment where whatever delays in the runtime, you know, minimizes the chance. But I think the only way to stop a flaky test from failing is to make it deterministic as well. So if we have tested or that are super flaky, I think yeah, I mean it might be worth looking at them and seeing if we can either change them or rewrite them or something like that. Yeah, so here this is like, I mean we could potentially, we got a denial list. We got a you know, a low list. Should we have a flaky list? And if this is flaky, we mark it as warning. I think you know from there to like we have a denial list, we almost never look back at it to fix them. So maybe we should be a bit more strict about when we put something there, we should capture why and we should try to go back to those tests and fix them eventually. There may be a technical reason, there may be a lack of time or whatever. But yeah, I think we need to do something about those tests. Maybe we retry multiple time and then if they succeed one, then once that's fine, but this need to be built. And I think it will help a lot. So about fixing those tests, is it it's the community in general, okay, like some of them is really hard to reproduce. It's okay to add some debug code, print some more, to try to level down the reason. No, it's, yeah, in the test. Well, I'm asking and suggesting, maybe like all the networking related tests should just by default be running with net and S as a baseline. And then like if you still have more flaky tests, then like we add, yeah, like I mean I don't see any problem like in adding a little bit of like the bug output because by default it will be filtered out if the test succeeds, so like it doesn't pollute the results. And I mean for sure like whatever you need to contain the environment. If you're going to use backend counts, you need to contain the environment in which it is. So the host may have whatever probe or whatever that run and that is going to generate stuff. I'm not counting the parallel mode here. Yeah, we don't run like in parallel mode right now, right? But like I do use parallel mode like locally all the time. So hopefully tests kind of take into account that there might be some other tests running in parallel. So I didn't mention that here. I actually filtered them out in that case just to make sure like, you know, we were just looking. They work sort of like the parallel mode works surprisingly well locally actually. Yeah, but given that we already have difficulties with the non-parallel one. Cool, so yeah, that's something that's I mean the flaky tests have to be addressed. I think the deny ones have to be addressed also like we should try to limit the amount of stuff we put there. As we maintain, which is something I would get back to, but as let's say maintaining the CI sometimes you need to put somewhere there is something there because you want the CI to be green again. So it doesn't get blocked on or get noisy because of something else. Talking about maintenance shape, ownership like at the moment, I mean a lot of maintenance are within Meta. We these data is currently in Meta. KPD runs in Meta, but it doesn't have to a bit of work could be done. Actually work has been done recently to get the KPD code open sourceable again. And you know, we should thrive to get this a bit more outside and more for community efforts rather than just like within Meta. These stats, the code technically the code could run internally and still export to different databases or different data sync. Now one thing I don't want is one extra service to be maintained. As much as I mean, I'm an infra person. I like when the infra is maintained by somebody else. So, you know, it would be nice to find a some service or some existing external service, which is already maintained that we could use for that. The code that also does this data collection could be easily like brought out with a bit of change. I didn't do this right now because there was no need. It was just faster to avoid. But yeah, so like how to make essentially the CI, it's a common good, right? It benefits everybody. You get quicker signal if your stuff work. If they get code before they get into mainstream, before they get into the BPF branches or even before mainstream kernel, we all benefit from it. So, you know, it would be nice to make this more of a common effort. And then once we have a green CI then we can also enforce to have, you know, tasks for a new code or we can enforce like, it's easier to enforce a CI once it's known to be very stable. So, I think we're not far from it, but there's still some work to be done here and participation will be welcome. And another thing which is difficult is when an error happened, how do you reproduce it? You know, as was mentioned before, like that's not an easy setup. You need to build the kernel, you need to build the image and you need to run that together. I think there may be a way with VM test to maybe try to make this easier for people to re-run the environment. I mentioned yesterday that the artifacts are already available from GitHub. So, we may be able to actually plug that in, try to get the artifacts, try to get and run that in a VM and be able to produce an environment for people to reproduce the issue. That is not always easy. And yeah, I guess that's about it. So, these are essentially questions. I guess if you want to also like, longer term outsource it, you know, outside of meta what you mentioned, I guess like the EBPF Foundation thing would be the natural fit and taking the money from there to pay the infrastructure. Yeah, I think it is more about reusing an existing infrastructure maybe better than building another one. But yeah. Can you talk some more? In the beginning you mentioned 75% of kernel BPF is covered when you looked at the test coverage, right? Yeah. Like what are the bigger gaps that we don't have right now? We'll defer that answer to you, Nikola. So, there are a few gaps that I noticed is one, we are not enforcing self-test for every change that goes into BPF. And like when I was reviewing code coverage, there were a few recent patch tests and didn't add self-test. So, we just need to be more mindful when, I don't know, maybe we can decide as a community that we don't allow to merge changes without self-test, right? Another big issue is we don't test failure branches right now. So, for example, if memory allocation failed, right, we don't have a good test coverage. And we had discussion internally, so some fuzzing would definitely help. I looked into the key unit, but key unit will require more experimentation to understand if it's even feasible to use the BPF scenario. Can we use the BPF error injection framework to test error paths? Have we thought about that? I started working on the prototype, but it's not there yet. The big problem, right? For example, in memory allocation, we use lots of inlining functions that are being all referred to K-mail look in the end, right? So, you need to make sure you isolate failure to the very particular area. In order to do this, you need to filter on the stack trace, right? And then we need to integrate blaze themes that Daniel presented the other day, right, to convert symbols into addresses. But yes, I think there is definitely opportunity to build such a framework. You mentioned fuzzing. I thought that syscaller does fuzz the BPF syscall. Is that something we could extend, or do you think some other kind of fuzzing would be necessary? So syscaller already has error injection for memory allocation, right? But then it's not deterministic. I mean, I guess you want to have this more deterministic, right, to see when... I was thinking if we can, you know, have separate fuzzers that will be more directed at BPF subsystem, right? And not to everything else, because like, you know, we can test bigger kernel. But when validating BPF, do we really care about the bigger kernel, right? We already have fuzzers that test the bigger kernel. Did you see the BPF specific fuzzer that Google folks recently outsourced? Like, where we had this precision bug, they found this basically with this one through fuzzing. So that might be something. So for BPF override stuff, right, like the failure simulation, like, you need to actually have a list of functions that are allowed to do this. So like, you cannot just, like, attach to any kernel function and, like, trigger the failure. But I actually had a question about, like, how do we make this more discoverable to people that don't know yet about BPFCI? You know, like, they send email, they see email. Some people probably don't even know about patch works. So actually, like, at the bottom here, we talked about sending emails multiple times. I think it's easier to send emails once you know that you got something which is high signal versus noise. But that's something that possibly we could do, right, start sending results back and sending back to that. I mean, obviously, it's a leading question, but, like, we have an audience here. I'm actually wondering how people will think about, you know, a little bit more spam. So, like, let's say, what if we send email every time that someone sends patch set, like, we send, like, reply, like, automatically reply from some bot with links to patch works, and they're like, oh, the BPFCI run started and all this stuff. And then, like, if something files, obviously, like, another email may be. I think personally, every time I will be bothered, but getting results like a, your, you know, your patch has been run through CI and you can get the result there. I will personally find that valuable. But, you know, I think this is very subjective to every single individual. I think it's also tough. Like, if most runs fail, then I think that's just going to become noise very quickly. Like, I would find it useful myself, but I feel like the shorter term goal should be to get a consistently green CI and then we can think about. Yeah, that's why I put it there, getting first reliable base and then we can enforce more. I have a question to Celium Fox. So, we recently added the restart to run the BPFCI. And I know that you guys have pretty complex BPF programs open sourced. Can we somehow integrate, basically build Celium BPF object files and run the restart on them for every patch set so we can, you know, identify with verifier regressions or whether we stop, we start failing verification on any of object files? I mean, I would love that. The one question is how we could do this logistically. I mean, like the Celium repository is changing like around BPF code all the time and then you have different configurations. Quite complex test metrics and different Celium configurations where we build in different code and then test this against the variety of kernels to make sure that it passes the verifier and so on. So, I wonder, like, maybe you have some thoughts. So, one thing we do right now is that, as I said, we have a bunch of kernels and then we run our, we don't run all configurations, we run some configurations. So, I think from Marside, like, one thing that's easy is to add, like, another kernel, which is, I don't know, BPF next or whatever, and have it frequently. And that's easy. To some extent we already do it. The difficult part is, like, what happens if it fails? So, like, we can, I don't know, we can automate to send an email. Like, I guess, you know, if all the previous kernels have passed, but the BPF next did not, then we can, I don't know, raise an alarm so that, I don't know, me or John have a look and contact at least or something. I don't know. Yeah, I mean, I think, like, I wouldn't want to personally integrate that with, like, our core CI runs. I feel like the right architecture for that would be to have a PubSub model with Patchwork or something like that. Because if, I mean, yeah, obviously Celium is like a very disproportionate presence in the BPF community, but if we wanted to start to scale out the CI so that other companies and other people could start to test their, either honestly their own programs or their open source ones, it feels like them being able to subscribe to something is probably more scalable. There is property value in just getting a manual dump first to get the baseline from and rerun that all the time. You know, I mean, it's better than nothing and then we at least get benefits from it. I'm just to clarify, I didn't mean that we want to run Celium tests. I meant that we want to get BPF object files built from Celium source code and run very start on them, right? So only to check that verifier is not regression, right? We don't care about the business logic of Celium per se, right? We just see that Celium BPF programs are complex enough. Yeah, load in load and the verification speed. So maybe we could have like a repository with object files and very start will pull them from there. And I don't know like whether from Celium site we could have like a mechanism where we then regularly push updates to it. Maybe that could be an option. So it's actually not that hard to build Celium objects. Like I did manual like very easily without knowing anything about Celium. So I think like we can just have like a GitHub action that like pulls the Celium goes into like one of the sub directors and like builds them like object files. The problem is that like some of the code is not yet like BPF compatible. So it cannot be loaded by where it was. And this is like literally everything that I had to do to make it compatible and which is pretty minimal changes, right? And does it work with Clang? Like that was the other thing I was going to say. So just I guess on the previous slide, we do run BPF next nightly. All your programs successfully validate, I think. Oh really? For me. Maybe I'm forgetting something, but like a bunch of them definitely validate. A bunch of them. It probably depends on the features too, right? So like in some of my patch stats I actually publish Celium stats and like various stuff. So definitely bunch of them. Another interesting one would just be like the feature permutation, right? Because there's a lot of if-defs in there, right? We probably shouldn't like block CI on like Celium failing, but it would be good to have a signal, right? Like in some of my good question. So yeah, just wanted to show like how minimal the changes are. Like this is all that's necessary right now. Maybe one more small suggestion is just in terms of change management. Like we ideally not picking like latest Celium in case we randomly break something there. So in Celium we're taking I think periodically a BBF-NIC snapshot and then integrating that into our CI. So that's sort of like a stablish latest kernel plus development version of Celium. I guess if we do this on the kernel side then we probably want like the latest, you know, whatever kernel patches, whatever plus maybe the most recent Celium version or something that's kind of a known not so frequently changing kind of target. And then maybe once a release cycle, Celium folks will come in and just say, all right, here's the new version or something like that. I mean like the Celium stable versions would definitely be interesting. Like all the object files with the most important configurations and then master. Yeah, that's a good first step. I agree. I think we should. I guess we would need like a stable East baseline anyway. Yeah. So agree. And again, probably like we will get a lot more benefit in the first place in getting whatever and testing. Even if it's not perfect, it's going to give much more signal than trying to strive for something perfect. What could be a good cadence for building BBF-Next like nightly or weekly? Thank you for that maintenance. So I think it would be easy to like build because now we don't really have a good cadence for building BBF-Next for our tests. So that would be one good first step. And then we can also like tests for, you know, the releases against whatever new BBF-Next we build. I mean, ideally I would love that this could be running on every patch that is coming in, right? Because if there's a very fair change that makes it, that blows up the complexity or whatever, then that user who submitted the patch, it's the burden on him to make sure that it's not like we're not regressing anywhere, right? The other way around is like, okay, we are building this on a daily basis. I mean, it would be ideal, but then it's on us again to bisect and find like... So from our end on the BPFCI, the patches we apply from patchwork, we also monitor BPF and BPF-Next. And any time, days, we will rebase and potentially cancel whatever was running rebase on top of latest of BPF-Next or BPF. And we also run some kind of con build of... I think a good first step would be like, if BPFCI builds these images for every patch, then if we can use them, it's much easier to run our tests. So like if we can get, I don't know, a URL with a kernel image for a given patch, then it's easier to run our verified tests. Yeah, so the artifacts with the bot should be accessible. And the artifact contains the kernel image. Yes, well, that's what I'm saying about, right? Yeah, it's a bit of code, but it's doable. It's a matter of having... I mean, we can get into the details outside of that. Yeah, I think there are two different issues, right? We want this for BPFCI for everyone besides Selium to see whether we regress Selium, just as an additional stress test, basically. If you want to integrate it into your CI, you just use Veristat and run it internally and do whatever you want. Why this coordination? The problem is your code right now is not loadable with Veristat. And we need your help to fix that, basically. I guess I was thinking as an intermediate step something that we can do today to get this, but I don't know if it's possible to actually fix the code. I think we should just build some infrastructure for Selium and Tetragon and OSS and then extract the object files, put them somewhere where Veristat can run it on all the patches, so that it's probably integrated. I mean, that would be ideal, right? Yeah, so one of the problems, for example, you use section names, arbitrary names, right? So you would need to fix that. The other thing is you're using tail calls and you have a special syntax encoded in elf sections, right? So you would have to change that as well, right? And we talked in the hall, like, you are switching to a Go library, which actually supports the same syntax as libpf for tail calls, so it should be easier going forward, but that's sort of the changes that prerequisite, basically, to anything. Yeah, I totally agree with that. I mean, that's something we need to look into. Can we do the same for Tetragon? Yeah, it's also a separate topic. One other thing that is great around the CI that we recently discussed with Lawrence was it's really cool for reviews as well, right? Because if you're used to the GitHub workflow, you can see the pull request, you can do the reviews that way instead of, for example, on patchwork or otherwise. So one thing that, like, when we just started the whole CI, it was for libpf, right? Like, the hope was that we will have kind of, like, a diff between two revisions of the patch set because, like, the way that it's done, like, we actually tried to reuse the same PR for, like, multiple versions of the patch set. And the hope was that, like, we can detect, like, what actually changed between two subsequent versions. And unfortunately, GitHub is stupid enough to not, like, support this. Like, you will, you will, like, I mean, you can compare two versions, but you will see also all the, like, baseline changes, right? Like, which by we too usually there are tons of them. The UI is very cryptic to you. The GitHub UI on the PR is very cryptic anytime there is push, force push and stuff, right? So separate topic that I wanted to touch on is increasing community participation. So we have on call right now for code reviews on the BPF mailing list, right? When not only maintainers are doing code reviews, but other folks are pitching in. Can we come up with a similar process for fixing CI issues? Because, like, from the meta side, we have few people working almost full-time making sure that CI is, like, keeping the lights on. Can we come up with some process where other community participants are participating in this as well? Maybe on a rotation basis, I don't know, but at this point anything helps. Yeah, and it would be good also to spread that knowledge across not just one company anyway. Well, the nodding of head was recorded. All right. I guess we are one minute away from lunch. Thanks.