 And how to build a Compromise Resilience CICD Pipeline. Trishan Kuppusam of Datadog will show us how they developed the industry's first end-to-end verified pipeline that automatically builds integrations for the Datadog agent. Let's look. Hi, hi everyone. I'm Trishan Karthik Kuppusami, I'm a Staff Security Engineer at Datadog, where I work on things like how to build a Compromise Resilience CICD, which hopefully will make sense in a few minutes. So let me set some background, some context, some motivation for this. Why are we talking about this? So Datadog, as some of you might know, is an end-to-end observability platform for your organization. Spends all the way from your network, infrastructure, all the way to your application and services, ultimately to your end-users. And so when something goes wrong, let's say there's a latency problem, for example, your DevOps, your security, and your business now can all use the same tool to diagnose exactly what went wrong and fix it on the spot. And so the agent is this piece of software that you typically install in your containers and hosts to gather metrics about your services. And you can think of integrations as plugins or add-ons that add superpowers to the agent, as it were. Let's observe even more. So for example, one of our favorite integrations is GitLab, and we're hoping you're using both today. So what happened was that a year or two ago, we had this problem where we wanted to decouple the integrations to release the integrations from the agent, because the way we typically release them was every six weeks together with the agent. Now, as you can imagine, this proved problematic. So for example, what if we wanted to beta test new versions of integrations with our customers? We couldn't do that so easily. So we wanted to decouple them. And of course, the state of the art for doing this is to use something called CI CD, one of the most famous examples being GitLab, which has done a wonderful job with pipelines and so on. And so, as I like to joke, so now you use this robot sitting in a cloud in the sky. It's got access to your code signing keys. Every time your developers check in code, this robot takes out a key, builds the package and signs it and puts back the key and releases the software. Wonderful. So you have on-demand, you have DevOps now, basically, which is great. And also good security reasons for doing this, actually. You'd rather have a single point of auditability and logging where you know who's using the key when, right, for example, robot, as opposed to a distributed team of developers. So great. That's the state of the art, CI CD, which is very good. So let's talk about what can go wrong because this is a security talk after all. You have to be optimistic here. In good times, 99.9999% whatever the percent of the time, everything's good. Nothing goes wrong. Life is good. The problem is when any of these pieces get compromised. So for example, imagine someone runs a video developer signing key perhaps to sign good commits, for example. Or your GitLab repository, for example, gets compromised. So someone tampers with the source code. Or let's say your GitLab pipeline gets compromised, right, the runners, for example. Or the container registry used to pull images, used to run the GitLab jobs. Or when your key or file servers get compromised. I think you get the idea. I'm sort of belaboring the point here. The point is that there are many pieces that can go wrong. And the thing is that it's sort of a gray swan. It's not even a black swan problem. A black swan is an unknown unknown. This is a known unknown. It's not a question of if you will get compromised, a question of when and you want to be prepared for it. Because like I said, 99.99% of the time, that thing goes wrong, life is good, but rare event that happens, it's negative infinity. That's the outcome. And so the state of the art is we have DevOps, which is great. But we're missing DevSecOps here. Let's talk about how to fix that. So let me try to convince you this is not merely some sort of a Hollywood bad hacker sci-fi kind of movie kind of scenario. This is actually problems that have happened in the real world, right? Let's take a look at three horror stories. One, as you might recall, is the flame malware from 2012, not too long ago. What happened was that someone, many suspect it's a nation state attacker, pretended to be Microsoft. They wanted to get to Iran, you see, to dismantle their nuclear program. And what they did was break, unfortunately, a poor hashing algorithm called MD5 that was used back then. It had theoretical weaknesses. But these attackers, whoever they were, found a novel cryptographic attack on MD5 that even academics had not seen before. And so they produced a fake certificate that looked like it came from Microsoft and propagated these updates all the way to Iran, where it cost the centrifuges to every once in a while very subtly mess up. So imagine this is a very high, very sophisticated attacker. And depending on who you are, this was a good or a bad thing, right? We're not playing with amateurs here, certainly. Scary story number two, more recently, Sea Cleaner, which is a popular Windows cleanup tool. I remember using it as a young man, myself, many decades ago, while they were busy being acquired by Avast, the security company, famous for making antivirus software, for example. While this acquisition was happening, unbeknownst to both of them, someone had compromised their build infrastructure and used to build malicious code that was then sneaked in to Sea Cleaner. Millions of downloads, but they were interested in 11 or so particular companies, one of whom leads to a scary story number three seems to have been Asus because they got their own software updates back later. And the suspicion is that they got to Asus by poisoning the well, Sea Cleaner well, and then they made their way through Asus. So we're not playing with amateurs. We're playing with very serious professionals, nation-state attackers even, who are interested in breaking into your CI CD pipelines, something you should think about, very seriously. And so, as a response, some might say, well, why not just use GPG or TLS, right? Wouldn't that solve the problem? No, remember, we're talking about nation-state level attackers. They're not going to be phased by a single key used to sign everything kept on your infrastructures. This is like the equivalent of keeping the keys to your house under the carpet. It's not going to buy you the security that we're looking for here, unfortunately. The property that we want is something that we call compromise resilience. What does that mean? Well, imagine if you're a medieval age king and someone told you, hey, we know that people breaking into fortresses is a problem. I hear it's a very common problem this day. So, but don't worry, we got the solution. We'll build you an impenetrable fortress. And so, as a king, you should be very suspicious. It's got to snake oil salesmen. It's no such thing. I know there isn't. Or it'll be prohibitively expensive. The point is that you cannot prevent the compromise, right? Because if that's invisible, it's not outright impossible. What you can do is to mitigate the impact of a compromise. So, you build defensive depth and security. You have multiple layers so that attackers, for example, emotes are under the alligators and flamethrowers and whatnot. And you get the idea. Point is, attackers look from far away and say, you know what? I got better things to do with my time. I'm going to move on to another target. So, that's basically compromise resilience and a nutshell. And so, how do we do this? Well, we're going to propose using two pieces of software, two pieces of technology called, in total and tough, that I claim, gets you this property. There's this joke. I don't know how many of you have seen this vitamin water ad, said it's 4 AM. Do you know where your vitamins are? So, I've twisted this joke for my own nefarious purposes to say it's 4 AM. Do you know where your CICD is? But all jokes aside, imagine someone pages you at 4 AM and says, you know what? It looks like a pipeline released something malicious. Do you know what happened? Let me tell you a story. So, I used to be a developer in a software company, one of the very few that had the privilege of being audited by DMFDA. Do you know why? It's because we made software for a big pharma to run clinical trials. You can imagine, very serious business. So, we needed to have the ability where auditors could walk in, say, randomly, 10 years later. 10 years to releasing a version of software. And they would say, show me the output of your unit and integration that prove to me that your software did what you claim it did back then. And we want the same auditability. We want the same compliance. For some of you, this may even be a compliance requirement. So, you can think of a CICD like this. X is the source code that a developer produces. And your CICD is basically applying a function f to x, your source code, and produces a package y. And what you want is the property that when you download and install a package, before you install it, you say, does y equals f of x? Is the correct x the source code? Is it the correct application, correct packaging of the source code into the package I'm looking at right now? So, that's the basic idea. And how do you do this? Well, one piece of the puzzle is in-toto, which secures the distribution of your source code, packaging it in your pipeline, right? So all the way between the developers and your CICD. And tough is the tool that solves the other half of the problem, which is to secure distribution of your packages from your package or boss tree to your end users. And you put the two together. I claim you get this property called end-to-end security, which lets you detect attacks anywhere between your developers and end users. So let's take a look at how that works. Unfortunately, I don't have time to go into all of the gory technical details, but here's what you need to know. We know that the problem exists. That's the bad news. The good news is that there are tools that you can use today to fix the problem. They call in-toto and tough, and that's all you need to know. In-toto and tough gets you compromised results. Okay, so let's talk about the first piece of the puzzle, in-toto, which remember solves the problem of detecting attacks anywhere between developers and CICD. The problem that in-toto solves is kind of a nice metaphor is the game of telephone that kids like to play. So imagine transmitting a source code to your CICD all the way to your end users, for example, sort of like playing the game of telephone that kids like to play. Used to be fashionable, I guess. I don't know whether they play it anymore, but imagine that attackers got in between these kids and tampered the messages. That's sort of how in-toto works. In-toto tries to guarantee the authenticity and integrity of these messages while it's being passed. So, for example, the first kid here says peace, and the second kid, slowly the message sources of noise which can think of as attacks. And unfortunately, fleas get delivered to your end users, which is a bad idea when you want to deliver peace. So let's try to fix that problem. How do we do this? Well, the basic idea is to define the software supply chain. You remember the f of x problem. This is basically what you're doing. You're saying this is my f of x. Here's what my f of x may look like. So Alice, who's the administrator, may say my supply chain looks like this. It consists of two steps where Diana is allowed to write source code, and she produces a sign at a station saying, I promise I produce food out of pie with this hash. And the second step is Bob, who's allowed to package the source code that Alice wrote. And Bob says, I swear I got food out of pie with this hash, and I produced a package star.gz. I simply compressed the file and serve it to users. And I produced a star.gz with this hash. Great. Now, what happens when it uses, before they install any package, they first inspect it. And all of this is done transparently using another robot, basically taking care behind the scenes for you, using software libraries. What happens behind the scenes is that Alice can say, look, before you install the package, make sure you get a sign at the stations from both Diana and Bob, and of course the supply chain itself from Alice, and check that all the rules that f of x is indeed the case. So for example, here we see that inspection is, you untar the food out of pie with gz. Make sure that, first of all, the tar.gz is correct. What would Bob produced? And then you untar the file and say, okay, food out pie was the one that Alice produced, and there was no attack in between. See? So no one other than Alice can tamp it with a source code. Okay? And so that's how you get supply chain integrity. Now, let's talk about the second piece of the puzzle, which is the update for Immokrtof for short. And that solves the problem of the so-called last mile distribution, which is the distribution of this build source code that you're end users. And the metaphor, the problem that Tuft solves, you can roughly think of it like this. So I don't know if you remember, but back in the 80s or 90s, there was a terrible, terrible attack in pharmacies, where someone malicious, obviously, went around and deliberately tamp it with medical drugs, people's medical drugs. And unfortunately a few people are poison and I believe they even actually died, unfortunately. So you can think of Tuft, what it does is, so in total is the thing that tells you, hey, my software is a list of ingredients. So Diana produced this ingredient with this dose and Bob produced that ingredient with this dose and the whole medical drug is composed like this. That dose did those and mix it together, put in a nice little pill. Tuft is the software layer that says, why should you trust this medical drug in the first place? You see, it's like the seal of freshness or authenticity in the hologram sticker that says, yep, FDA approved, good to go. Nothing has been tampered with. That's the rough idea. And so I don't have time to go into details, but basically uses design principles that believe me, a grandma would have told you as a kid. I don't know about your grandma, but by grandma told me, if you ever need to launch a nuclear weapon someday, son, make sure you use the two-man rule, which is actually the case. What the US military does, for example, is that you actually have physical separation, where not only are two keys required, but the same person cannot turn the two keys together. You literally need two different people. And the same idea, we have structural signatures here. So we also use design principles. Grandma told me, don't put all your eggs in one basket. So that's why you see multiple eggs all over the place here. Grandma also said, make sure you use cryptographic agility. And that's the story of the Hydra. So for example, remember the flame malware attack? That happened because unfortunately, you used one weak hashing algorithm called MD5. In tough, we use multiple hashing algorithms with different designs. So shot two and shot three, for example. So unless you can break them both, you're not gonna be able to break the security system. So anyway, you put all this nice design principles that grandma told you, and we actually designed this in collaboration with Tor, who wanted to obviously protect the software updates from people as powerful as the nation's data attackers. Okay, enough theory. Let's talk about practice. How do we actually apply tough and in-toto in practice for the data doc agent integrations? Remember, it was the original problem we were trying to solve. And I claimed that putting the two together, you get end-to-end security. Okay, so here's what our software supply chain looked like. Earlier we looked at a demo. Let's take a look at a real-life example. So our software supply chain's got three steps. The first one is called DAG as in like good DAG where developers use UPG keys sitting on the UP keys and trusted hardware. So you can't even export the keys. So they signed the sign at the station saying, yep, I produced this source code with this hash and I'm checking it into our Git repository. Great. Now, CI CD's broken into two steps where it says, I swear I got this Python source code from the Git repository and I'm packaging it into a Python wheel, which is simply a zip file. You can think of it as star.gz, basically. And it packages the Python source code into the zip file. And then we have a third step called wheel signer, which basically takes this packages and in-toto metadata, puts it together in this tough seal of freshness and authenticity and integrity and distributes it in a nice coherent package to end users. And when our end users install this, they have no idea that all the stuff in total verification actually happens behind the scenes. What happens behind the scenes is that we tell the agent, hey, look, first make sure that the wheel was produced by the wheel signer. Great. And then unzip it to make sure the source code was signed by our developers. So you can see that unless you get developer signatures, even if you break into our CI CD, develop a key, sorry. If you break into our CI CD, you won't be able to forge developer signatures, which is where we get the end-to-end security from. Okay, and then we add tough. I know the picture looks complicated, but at least simpler than it looks like. Basically, we're solving three problems here. What we're using tough is to say, first of all, there are many keys to the entire system, including distributing developer keys. Think about it. How do you safely distribute the developer keys, the software supply chain? That's what we use tough for. So one root key that we distribute with the agent, and then you can change transparently, rotate the keys in the rest of the system and the end users wouldn't even know. We know because we've actually done it several times now. And so what we use tough for is to distribute the software supply chain in a compromised, resilient way. So just because someone breaks in our pipeline, they won't be able to rewrite the supply chain. They won't be able to rewrite the public keys used to verify the supply chain. And even though they're a machine and those are the things colored in red, signing some things, they're not signing the crucial bits. They won't be able to mess with developer signatures without being caught. That's all you need to know. That's the level of detail you need to care about. If you're interested in more gory technical details, please visit this link here. Okay, so what does all that gobbledygook get us? Let's see. Well, when nothing has gone wrong, clearly nothing, but I think it's important to also point out that your end users actually, our end users actually don't see any difference with or without Tuffin and Dota, which is to say that we add very little performance or network overhead to the security system. So most of the complicated, the setup, the price of the setup is in one time setup with the whole thing and then the system basically pretty much maintains itself after that. Okay, well, let's see where the real beauty of the system lies. One is, okay, so if our developer keys are compromised, yes, theoretically what can happen is that attackers can release malicious code. But remember the compromised resilient property I was talking about earlier, like the medieval fortification. We believe we've set the bar so high that this is actually more theoretical than a practical or let me explain why. First of all, our developer signing keys are trapped on UB keys. We generate them on the hardware. We never export them. We are unable to export them. And so unless you physically attack one of our developers, sorry, please don't do that, you won't be able to get the keys, okay? That's the point. You can't remote the run away with the keys. And the second thing is that we require you to touch your UB key every time you do a signing operation. So even if this malware is sitting in there on our developer's laptops, saying, hey, you should sign me our developers would say, wait a minute, what's going on? I'm not signing anything right now. This is fishing. And the third thing that we could do but we've not done for the sake of usability right now is require at least two different developers to sign up on the same source code. Right? This is so that the developers can more easily release source code. But you can see how it can easily increase the security of the system without actually answering usability too much. So in practice, our little pickies are very, very unlikely to get compromised. And the beauty of the system comes through the rest of the way actually. So what happens when our GitHub repository in this case, but could be GitLab, gets compromised. Nothing. We don't lose sleep. In fact, we've seen accidental DOS attacks where developers sign mismatching versions. Basically what the GitLab pulls doesn't match what a developer signed you to, not merging branches properly. And so it looks like an attack but it isn't. So it tends to be a very good test of this thing in practice. And the point is that even if our Git repository gets compromised, we don't lose sleep. Because the attacks won't be able to propagate. Basically the downloaded would block it, as you see here. What happens when GitLab gets compromised? Same thing. We don't lose sleep. What happens when we contain a image registry that GitLab uses to get compromised? Same thing. Don't lose sleep. What happens when our key servers or file servers get compromised? Same thing. You get the idea. The point is that our download is transparently on behalf of our agent verifies this stuff and endota metadata. And the moment it smells a slightest submission of an attack basically denies installation of the package. And I should mention, I should take the pain to mention that as far as we know, this is the first in the industry. We haven't seen any public discussion of any similar system. This is the first compromised resilient, at least the first publicly discussed compromised resilience CICD that we have seen. And note that there's no trusted hardware here except for Ubiquis. The cloud is perfectly untrusted. You don't need trusted hardware such as Intel, GX and similarly complicated trusted enclaves which have their own security bugs these days. All you need is something like Ubiquis. In fact, you don't even strictly need it. And you have this very high level of security. So the point is, as I said earlier, the bottom line is there is a problem. Yes, which is the state the CICD practices. Unfortunately, typically we don't build compromised resiliency in CICD. That's the problem. But the good news is, there's two pieces of technology, open source technology, both of which are on the CNCF which you can use today to secure your own GitLab pipelines. And they're called in-toto and talk. Gives you end-to-end security anywhere between your developers and users. I should mention, should take some time to mention that all of this work wouldn't have been possible without some great people at Datadog, NYU and VMware. And I don't have the time to personally shout out to each of you, but you know who you are. And thank you very much.