 All right. Hi everyone. My name is Itai Shakuri. I lead the open-source team at Aqua Security. I was supposed to give this talk together with my colleague José, but he got stuck in the storm, so José was an inevitable part of this presentation, but unfortunately, he couldn't make it. So, we're going to talk today about something you probably haven't heard yet in this conference, which is supply chain security. But I hope it will be a refreshing one for you, because we want to see if we can take concepts or principles from the runtime security practice, which we have some experience in, and apply those into the build time. Specifically, when I say runtime security experience, I'm referring to our open-source project that is called Tracy. I know if you've heard about it, so I need to just set the level. Tracy uses eBPF to tap into your system and give you access to hundreds of events that unveil how your system actually behaves. Moreover, it can help you detect suspicious behavioral patterns in those events, either using library of building signatures that we ship with Tracy, or you can write your own. We call them behavioral signatures based on the event stream that we generate. That's Tracy. It's open-source. Go check it out on GitHub, just to make sure when I say Tracy, you know what I mean. All right, so I said we want to see if we can take runtime security concepts like Tracy and apply them in the build time. To explain the motivation for that, I need to take you back a little bit, just a couple of years ago. Different times, in cloud-native times, a lot have happened in two years, very early days for the supply chain security category. Before all the S-bomb craze, before the executive order, before tools and standards that didn't exist, so early days, some meaningful attacks that happened and opened the RIs and everyone else to look into this stuff. Specifically for us, it was the CodeCov bridge. I don't know if anyone here remembers that, but basically it targeted GitHub actions users. The attackers were able to compromise a very popular GitHub action, CodeCov, CodeCoverage scanning tool. Through the upstream of that upstream, they were able to actually compromise you or your build. So classic supply chain attack and it made us and others think what we can do to protect against that. At that time, we still had Tracy open-sourced for a while. Tracy was open-sourced in 2019. So we thought, why not just try and use Tracy in the pipeline? I was actually a little bit skeptical. Can we run EBPF in a managed service like GitHub? Apparently, yes. They don't care at all. Tracy just ran. We wrapped it in a GitHub action. That was the first incarnation of the Tracy action. We just wrapped it in a GitHub action. You add it to your pipeline. It starts running in the background. It is looking for any of the suspicious behavioral patterns that I'm indicating. It worked. It was a very nice first step. But first lesson that we learned is that the build time is not the same as production. Specifically, Tracy had back then a limited set of signatures that we shipped it with. And they were very tailored to production. So, for example, one of the signatures looked for, let's take an example. So enforcing immutable infrastructure. This is something that we wanted to do in production. You know, immutable infrastructure, basically you need to pre-bake your containers, ship them to production. You don't need to introduce any new software to production directly. So that signature looked for any new executables being introduced in production. Very good, probably best practice to do in the production. But very bad thing to do in build time because the build server pretty much has only one job which is to produce new executables. So it didn't work out. So we had to fine-tune the list. But I think the more important lesson here is that we could make assumptions in the build time that we could not have made in the production. So Tracy was built with production in mind. All of the signatures had to be very generic and abstract. We didn't want to assume anything about what you are running in production, what tools you are using, what's your tech stack. We don't care. We just want to look from the below, from underneath, and to see if something suspicious is looking for us. But in the build, we actually can assume some things. Like, I don't know, you are a Go shop. You know which tools you are using. You know your tool chain. You are not going to switch to be a Python shop every other day, right? So we know how the pipeline is going to look more or less. It's going to be pretty much consistent. So we can leverage that to write more specific signatures. Things that we could not have afforded to do in production we can do here. For example, let's do something specific to a Go build pipeline. Like the Go mode file should never change during the build. If you want to change it before, make a full request, have someone review it. But it should not change during the build. So we can do specific things like that. And we started to do specific things like that. Like to write signatures that look for very specific things. But as you can imagine, it's like a very long list of bad things that can happen. It's going to be very hard to maintain. So then we thought, why don't we do the inverse? Instead of looking for the bad stuff, let's just define what is the good normal behavior of your pipeline. And just enforce it. So in other words, we started to do profiling with Tracy. This was the second incarnation of this project. And we introduced a profiling feature to Tracy. You still introduce it the same way to your pipeline. It builds the profile automatically for you. So we ditch the signatures. We build a profile of every executable that we encounter during the run. And it's supposed to represent more or less the composition, like not the composition, but what your pipeline is made of. How does it normally behave? Like I sometimes think about it like an S-bomb for the runtime or something like that. Maybe that's another talk. But we generate a profile. Everything that was executed during the build. And then you review this. You accept it as the baseline. And then next time it runs again, if we see something else, if we see something that we didn't expect to see, we will let you know. Another lesson learned here is that building a profile is not that easy. Especially when we consider executions and things like that. Because there's a lot of very volatile information involved there that you know that's going to change. For example, if there's a process ID somewhere in the profile, we're dealing with executions. So probably there's a process ID recorded here and there. You know that it's going to change the next time you run it. So it's a very difficult thing to balance between collecting enough information to make the profile meaningful, but not too much information to make it annoying. So we had to go through some iteration to fine tune what we do and what we don't want to include in the profile. And we removed a lot of things from the profile to make it stable. And we thought that we can compensate on the information that we removed from the profile with signatures that will help balance that. And that brings me to the third incarnation of this project and the current one, where we basically take the best of both worlds. So you still introduce traces to the pipeline the same way. There's a GitHub action, you add it to the pipeline, and it runs in the background. It will look for any of the suspicious behavioral patterns using the built-in signatures. Since then, it was greatly expanded, but still the same thing. You can also write your own signatures on top of that. And using that tool, you can find specific bad behaviors that you want to look for. That's a very good tool. But at the same time, we also build a profile. We also expanded the profile to include more than executables. We also look for file modifications that happen during the pipeline and network activity that happened during the pipeline. And the profile lets you represent the normal behavior. So it's kind of like an allow list and denialist approach. The signatures lets you declare what is the bad stuff that you're looking for. The profile lets you declare what is the good stuff that you want to enforce. And you can pick the best tool when you want to introduce a new security control. You can pick either. So that's the current version. Let's see before we dive into it. Let's see how it looks like. All right. So we have a Go project here. You see a main.go and go-mod. And we have a pipeline in GitHub that builds it. So you see the normal actions. You see go-mod-verify, go-test, go-build. And you can see the Tracy start and stop actions to this Tracy. You can also see a suspicious action here, fake upload. Right now it's good. It will turn row later on. So far so good. Let's push this project into GitHub. And this is the first commit. So it's the first time we are running this workflow in GitHub. And the GitHub workflow will run. And we will soon see that it fails. The reason why it failed is because Tracy failed it. Because it's the first time you run it. There is an unknown profile. You need to acknowledge it. So we created a new pull request. This is the pull request that asks you to review and commit the profile. You can see here all of the things that we saw that should happen during the pipeline. This is the DNS profile, execution profile, and files being modified. Three files introduced. So we reviewed everything looks good. We merged this pull request. Again, this is not the pull request that we made. This is the pull request that Tracy made. And now we need to go back to our pull request and update it because now there's a profile. So we go back to the code. We update main. Let's check out our branch and merge it with main. So basically now we update our pull request with the newly built profile. Go back to GitHub. And this is the sample request that we had in the beginning. Now the pipeline is rerun. And it passed this time because nothing changed from the profile that it knows to the current execution. Means that we can merge it safely. Nothing looks wrong here. All right. So some time passed. Someone working on this project wants to introduce a new feature. So they update main. They would create a branch. Let's call it this is a new feature. And we have a new branch. It doesn't really matter. So it's going to be an empty commit. We just want to re-trigger the workflow. It doesn't matter what change we make. Because what we want to demonstrate here is that it might be that one of the GitHub actions that we used has been also updated during this time. Like happened in the code code bridge that we discussed earlier. And let's say now we are at the maintainer of that action. Let's say we pushed a new version of that action. And we use the same tag. And it means that our pipeline will now use this new code automatically. So we go back to our feature. And we create a new pull request to introduce our cool new feature. This new pull request will trigger the workflow. The workflow will use the fake uploader action which is now have been updated. And you can see that this time, unlike the previous time, this time, this pull request failed. The pipeline failed. Also, we see a comment here from Tracy saying that we saw something suspicious happening during the pipeline. Specifically a minor domain. Something that looks like a minor in the pipeline. We have all the raw data here. And also a new pull request was created to show us what is changed. What exactly was changed? So we see that a new process was executed. We see the shower for it. We see the arguments. And we also see that the file was modified. Main.go was modified during the pipeline. It didn't happen before. It shouldn't happen. So we use the pull request here as kind of a user interface to show you what is different between the last time and this time. So that's how it's looking like. And let's talk a little bit more in depth about what we saw and what we had to do in order to make it work. So there's the workflow. And you already saw that the way that we introduced Tracy is using GitHub Action. Actually, a pair of actions. One to start the trace and one to stop the trace. Everything in between will be captured and monitored. The first challenge that we faced, and we're trying to recap like the past couple of years of this project, was that running eBPF at least a couple of years ago was not that trivial, especially in a remote service or a managed service. And in the beginning, Tracy was basically compiling the eBPF code on the node. This was the common practice. It still is in some cases. Compiling the eBPF code on the node when it runs. It means that it carries a very heavy weight toolchain in the container of Tracy itself in order to just perform the compilation. But also it means that it needs some dependencies like kernel headers depending on the machine that you're running on. It's a very complicated process, error prone as well, depending on an external environment that is not up to us. Like, does the machine have kernel headers available? Is it in the past that we expect it to be in? We're not always sure. So one of the very significant changes that we made was to make the eBPF portion of Tracy compile once, run everywhere. This is how it's called in the eBPF world. Basically means to make it portable in simple words. It's simple words, but it's very hard to achieve in the code. And in order to make it happen, again, there's some dependencies on the machine. Not every Linux kernel supports this in the same way, etc. And we had to create a separate project. It's called btfhub to solve the problem of eBPF portability. But anyway, long story short, it works now. And it's one of the first projects to do something like that. So it's now very easy to run Tracy everywhere, including in your pipeline. Another thing that we had to implement here and that we learned the hard way was that when we are tracing stuff, especially for security purposes, you need to be very certain that you capture every single event, every single bit. You can't afford to lose any information. And the way that this orchestration of Tracy in the pipeline worked in the beginning, we were losing some events because Tracy was starting, the pipeline was starting, there was kind of an opportunity there to lose events. What we had to do is to make Tracy block the pipeline until it finished initializing. We had to introduce this, again, not so trivial feature to Tracy, in order to block the pipeline until Tracy says, I'm finished, I'm all ready for tracing. And then the pipeline starts running. So there's a minimum delay here, but because we switched from compiling the eBPF code on the node to the compile one to run everywhere, it means that Tracy boots much, much faster than it used to. So these two goes hand in hand. What else? So another thing we need to consider here is that there's the, GitHub gives you basically a VM to run your code. In this VM, there is the workflow itself that is running. And in order to avoid noise from the machine itself, like this is Linux, a lot of things are happening at any given time, I don't know, a system update might start sometime and we will see it because Tracy is running in the kernel. So in order to focus Tracy to look only on what's interesting for us, we have this filtering mechanism in Tracy that we can use. It's an old feature. You can tell it exactly what to filter on. So in this example, we identified the way that GitHub sets up the runner and we saw that there's one process. If we trace this process tree, basically this process and everything underneath it, it basically means we're tracing only the pipeline. So this was not too hard to do, but considering that there are other areas in the machine that we needed to look into as well, started to face some problems. For example, some things will run as docker containers. These docker containers will not be under this process tree that is basically the workflow itself. The workflow would start the container, but the container process itself would appear under the docker demon or something like that. So we need visibility into that as well. Right? But if we filtered to see only the process tree, it will conflict. Another thing is that sometimes we need visibility into the host itself as generally because remember we have the generic signatures that we brought from Tracy, from the production kind of use case, then why not use them to look into the host as well? So we have now a challenge that we want to trace different scopes for different purposes. For the host, I want to see these kind of events. For the workflow, I want to see another set of events, only the things that I need to build a profile, for example. For docker, I want to see other things that help me detect docker-specific suspicious behavior. So this was something that we were dragging along for a while. But recently, we finally launched a very significant feature in Tracy that we called multiscopes, basically allowing us to trace different, to create scopes basically. A scope is basically a set of filters that are independent of one another. So we have, for example, one scope. This is the syntax how it is in Tracy. So you tell Tracy to trace and we create scope number one. This will be just the list of all the signatures that we want to observe on the host. So fileless execution is one signature. Hidden file created is another signature. These are the events that you want Tracy to trace. And we don't apply any special filter here, except by saying, this is one scope. And then there's another scope that says that we want to look for file modification events. This is a kind of event that Tracy admits. But in this scope number two, where the file modification event is in, we want to also limit the file modifications only from the GitHub workspace directory, basically where GitHub checks your code into. Otherwise, we will just see all the files that are changed on the host, which is not going to happen anyone. The third scope is basically to say, let's look at the GitHub runner tree and the Docker tree. And in these cases, we want to see the executions and network activity in order to build the profile. I will show you more about this in a second. But the point here is that scoping the trace into different use cases and for every use case, tracing only the relevant events for that scope was a very, very critical thing to do. All right. So, let's look a little bit deeper into what Tracy can tell you about every category. Executions is one category. In the executions category, there is a bunch of signatures that are introduced just by the nature of Tracy's there. So, looking for suspicious things that might happen on the hosts like suspicious executions patterns. Let's pick just one, for example, code injection. Some process is trying to inject code to another process running instance or a LD preload. Someone is messing with the dynamic linker or something like that. So, this you get for free, like in quotes from just Tracy being there. In addition, Tracy also builds an execution profile of what happened during the build. What is in this profile? First of all, binary pass, like what's the binary that was ran? The binary hash, very important to know if this process that is called LS is the same as another process that is called LS. The user who created it, this was there for ages and it's kind of goes without saying. But then we found out another lesson that we were missing by not including the arguments for the process. Just to give an example, let's say you have in your pipeline a curl into code code.com. And then someone managed to change it to curl into my bed minor.com. It's the same curl. It's the same hash. If you just trace executables, nothing changed, right? All good. But no, just by changing an argument, I dramatically changed the behavior of the pipeline. So, we needed to include that information as well. But the problem was that process arguments include a lot of volatile information. Like, for example, when you do a git clone or something like that, GitHub creates a temporary directory and that directory name is changed every time you run the pipeline and that is being passed as an argument. So, a lot of volatile information that would just pollute the profile. So, to solve this, we had to introduce another kind of feature that is ignore system that lets you basically say these kind of things, I know that they will happen. I want to ignore them. And another kind of blind spot was the environment variables that every process had access to. So, we wanted to include that as well. But that created another problem. So, environment variables contain secrets usually. And if we are including that in the profile, we're basically committing your secrets to the source code, not something that you want to do. Now, you have the ignore system. So, you could ignore these kind of environment variables and say something that is called GitHub token just don't include it in the profile. That would solve the problem. But just out of precaution, we decided to make it an opt-in feature, just environment variables. If you want, you can enable it. We would include environment variables in your profile and then you're encouraged to filter the secrets out. Another interesting thing is how do we even detect an execution? How do we know that something was executed? So, the obvious, like the naive thought would be to look into the system call that is invoking new executables. It's called execve. It's very commonly used in tracing tools. But we found out that it was not good for use case for a few reasons. I would say categorically tracing system calls was a little bit problematic for a number of reasons. First of all, we need to understand that the system call is not really like invoking function. It's more like the user requesting the system to do something. It's not necessarily what will end up being invoked. And now that we know that, there are cases where the user might request to do something and by the time that the system actually got to do it, the user might change the request. So, we traced X but the system invoked Y. So, that is a kind of attack called time of check, time of use that for us as a security tool, we just didn't want to be in this position. Another problem with the system calls and execve is that the user might pass arguments that are high level because this is like an interface from the user to the system. So, let's say I'm telling the system to invoke this binary, I'm giving it a path. That path may be relative to some other directory. That path might be a file descriptor that I obtained and obtained earlier and I don't know it. That path might be a sim link, for example, that the system needs to resolve. So, if we were just tracing the execve calls, it might be that we were seeing meaningless information for us. We would see, for example, that file descriptor five was executed. What does it mean? No one knows unless they had access to in the entire capture of that trace. So, our solution was to switch and use another event that Tracy produces. It's called the sketch process exec. It's like an internal trace point in Linux. It solves all of those problems. It's not vulnerable to time of check time of use. It gives us the real path, how we call it, like the resolved path that is the absolute path to the file on the disk. It also gives us the hash and many more information that we include on that event. So, that's another kind of lesson that we took. All right, let's move on to another category of things that we include. Files being modified. First of all, there's a bunch of signatures, same here, that look into suspicious file access patterns. For example, someone changed the sudoers files on the system. Shouldn't happen, definitely not during the build. This is the kind of thing you get, again, in quotes for free, just by tracing being there. The profile also includes file being modified, and I mentioned this before. We want to limit this to only the GitHub workspace directory. We don't want any file that was touched. We just want the source code files that were touched. How do we know what is source code? We just say everything in the GitHub workspace directory. Again, here with the trigger, it was a little bit tricky. If you want to know when a file has been written to, the intuitive approach would be, okay, there's a system call for that. It's called write. If you trace write system call in your system, it's totally unmanageable because there's so many writes happening at any given point, especially on like UNIX, Linux, everything is a file, so basically impossible to deal with. It created a little bit of a challenge. Our solution to that was instead of tracing the actual writes to the files, we're tracing when a file has been opened for writing. This is an action that the program has to take. If you want to write to a file, you need first to open it technically. This is how it works. You need to pass the right flag. This is what we capture instead. It's a nice trick. We use it later on again to trace the intent to do something, and not necessarily the something itself, because it's a lot more manageable. Another piece of this puzzle is network activity that Tracy also includes. Again, some signatures there. By the way, network is a relatively new thing in Tracy. We have a very robust network tracing capability that also includes protocol parsing. This is unique kind of in the tracing world that you don't just trace except system call, for example, and then you get gibberish. You can trace HTTP calls, for example, and know that this process did an HTTP get to this IP, et cetera. What we would like to do here is to know which web services my pipeline interacted with. This is what we want to do. The problem is that there isn't a concept of a service in the network world. It's a very high-level concept, but in the low-level network, there isn't such a thing. There is just TCP to an IP, but we cannot even deal with IPs, because IPs are also very dynamic. They will change for good reasons. Again, touching back on the lesson from before that instead of tracing the actual thing, sometimes it's easier to think about tracing the intent. In this case, if I want to communicate with another service, before I do that, I need to do a DNS resolution. This is like the file open from before. If my pipeline communicated with an external service using a domain name, we would catch that, because there would be a domain resolution, and we are catching domain resolutions, we give you a list of all the domains. If it didn't use a domain name, then this is something that we consider a suspicious behavior like contacting a bare IP, and there's a signature for that. There's also other network-related signatures. Someone opened the reverse shell, like trying to get a connection outbound that tunnels your shell to an external endpoint and stuff like that. There are a number of signatures. You can review them later in Tracy, but network activity is another section that was added to the GitHub action of Tracy. Now, after we understood that, it was another kind of aha moment that I don't think we've yet to fully understand it, but just wanted to share it even at this raw point that we have a lot of good information at hand. We have executions, profile, we have network activity, we have files being modified during the pipeline, a lot of good information. We use it in Tracy to tell you if something doesn't look right in your pipeline, but maybe you can use it for other purposes as well. Specifically, there is the salsa specification that deals with so they define it like, they define a provenance as the verifiable information about software artifact describing where, when, and how something was produced. We have a lot to add about how something was produced. We know exactly how it was produced. Can we use this information in this context? And then we saw not too long ago, actually, that there is a proposal to create a format, a station format. It's called a runtime trace that should complement the salsa predicate. And from that spec, I'm quoting, the runtime trace can prove that the build was invoked via script, that the build was executed in a hermetic environment with no access and so on. So this is exactly the kind of information that we already collect. And definitely when this thing is a little bit more matured, we would emit the information in this format as well so that you can use it for other purposes like complementing the salsa station. All right. So we're at the end. This is like a recap of the lessons learned. Runtime is not a build time. We learned something there. We were able to increase the coverage by using signatures as a deny and profile as an allow kind of controls. We looked at how the blind spot of tracing tools, how we overcome them with the specific features of Tracy. We talked about system call tracing as opposed to other approaches, what are the triggers, et cetera. And this is how you can look at all of this, actually, Tracy on GitHub or Tracy action on GitHub. And I'm here if you have any more questions. Thank you very much. And enjoy the conference.