 All right, so we're going. Morning, everyone. Thank you for coming, just a little bit up. Thanks for coming out here bright and early to this talk. Hopefully, we'll have a little bit of fun here today and try some interesting things. So this talk is going to be about getting initial access through leaked credentials. And it's going to be kind of done in the mindset of an attacker. So we're going to go through various ways that an attacker exploits credentials, finds credentials, how they use them. We'll look at some real-world examples. We're going to do it in a number of different ways. And then at the end, we'll kind of look at how we can counter off that and defend against it. So a little bit about me first. My name is Bikensi. I'm from Aotearoa, New Zealand. But now I live in the Netherlands. And I work for a French company called Git Guardian. And you can find me anywhere on my socials at Advocates Mac. So this presentation is going to focus a lot on, you're going to hear me say the word secrets a lot. So just to get everyone on the same page, what are secrets? So secrets are digital authentication credentials. And typically, these are things like API keys. They may be security certificates. They may be database peers or other credential peers. So it's something that gives you access to third-party services. Allows you to ingest data, encrypt data, decrypt data, all of that kind of stuff. What's important to keep in mind about secrets is these are made to be used programmatically. They're made to be used by your applications. They're not meant to be used by humans. But humans touch them. This is where the problem lies in. So that's what I'm talking about when I'm referring to secrets. How do we use secrets today? Well, we just take a look at our modern application that we have here that's doing all kinds of fun stuff. We may be focusing on one element. We're trying to do something unique with our application. But our application needs to do lots of things as well. So what do we do? Well, we leverage different services, particularly third-party services, to help us do this quickly. So the easiest one to explain is credit card processing. Do you want to write your own credit card processing and deal with the financial complications in law, or should you use Stripe? Or same with authentication, like Okta, Algolia. So quickly, our applications end up being a collection of these different third-party services. All of these services need secrets to be able to communicate with each other. But then we need to host our application somewhere. We need to host our code somewhere. We need to test our code somewhere. And so our infrastructure becomes a collection of these services as well. And these all leverage secrets as well. Then once we've launched our application, we've got to monitor it. We've got sales integrations. And we're not even talking about all the microservices or independent services that we've created, perhaps using APIs. All of this leverages secrets. Every single one of these logos as an attacker is a potential entry point for me to gain access to something. So this is how we end up with it. And this is a simplified version. Your applications can end up very quickly having thousands of these applications, these third-party services. And now you have to manage thousands of secrets. So I work for a company called Git Guardian. And each year we publish a report called the State of Secret Sprawl. And this is basically looking at different areas that we've monitored to try and find leaked credentials. So the number one place that we look is GitHub. So this is the largest contribution place of source code. If you see my other talk, the next two slides will be familiar, but then we'll get into new stuff. But I want to do a very quick demo using GitHub. And I'm gonna come back to that in a minute. So I have here some credentials. These are AWS access tokens. So these are really sensitive, but these are what we call honey tokens. So this is basically a trap for attackers. And so what I wanna do is I have here public repository. This isn't formatted correctly there. So this is here is a public repository I've just created called DevCon. DevCon, and I'm going to commit these secrets in this public repository. So by the way, please never do this. This is a terrible idea. But what's going to happen? In about 20 minutes, near the end of my talk, I'm gonna come back and I'm gonna show you how many times those keys have tried to be exploited by an attacker just in this presentation. So I've pushed them publicly on GitHub and now a bunch of bots are scanning the GitHub API and I'll show you how that's done to try and find these credentials and abuse them. And so we'll talk about that bit more in a minute. So moving on, GitHub, very popular place. Over a billion commits are made to GitHub every single year. 85 million new repositories were made last year and these are just public. So lots of information, lots of source code in here. Every single one of those commits, those billion commits at GetGuarding we scanned for secrets. And we published our results of how many secrets we actually found. So if you know the answer, don't put up your hand. But anyone here, yeah, I'm seeing some familiar faces. Who here thinks that we've found less than a million credentials on GitHub? More than a million? Well, okay, more than a million under five million. More than five million, more than 10 million. So we found 10 million. 10 million secrets that we discovered in GitHub public repositories last year. Now you may say, okay, but how do we know that these are actually real credentials? How do we know that these aren't just kind of test keys or high-entry strings that look like secrets but aren't? And we get around that by validating them. So if we can, if we find an AWS credential like the one I just leaked, then we will check with AWS to see if that's real. And if it's not real, we'll ignore it. So if we look at the progression, we're leaking a lot more secrets than we used to. So in 2020, we found three million. We compare that now, we've found 10 million. So part of this is explained by an increase in GitHub, more code, more secrets, right? But also it's because we're using secrets in different way. Now we've got infrastructure as code. So this is changing how we're programmatically using these. And so we're using them in more ways which is why we're getting more secrets. We can actually have a look at the types of files that leak these secrets. So Python is number one, not because of anything to do with Python, just because it's the most popular language. But we find them in lots of places. JSON files, .env files are really big ones that we find. I leaked mine in a .env file to make it a little bit easier for attackers and lots of other areas. So we really find them in all sorts of places and we have a very long list of other extensions as well. And then we look at the types of secrets that we most commonly find. Data storage is number one, so there's databases, but then also cloud providers are 20%. So this is 2 million cloud provider keys that we found last year in a public repository. And remember that these are only valid cloud provider keys. So 2 million keys, you can do a lot with that. If we never wanted to pay for cloud hosting ever again, we could easily do that, but we're not that malicious. But there's lots of interesting things, version control platform keys. This one always amuses me because this is your GitHub credentials to your private repository that you'll somehow put in a public repository. So a bit weird, but it happens. Messaging systems is also another big one. I love these as an attacker because it means I can launch internal phishing campaigns using your own messaging systems. So like a Slack webhook, I can post in there. And if we wanna look at the kind of specific secrets, there's thousands of secrets that we look for. So it's a very long list, but Google API keys are really the number one. Google cloud keys going down and we find lots of other interesting things as well. Google OAuth tokens, these are very sensitive, so we're finding lots and lots of these. So lots of different secrets that we're finding out there in GitHub. All right, so how to attack is find these. There's a couple of ways. This is the first way and it's the least interesting way in my opinion, but I'll talk about it because it's the easiest. So this is just using the GitHub search feature to try and find credentials. So here I'm looking for a file name called credentials and I'm looking for an AWS access ID inside that. The syntax has changed a little bit since this slide, but it's the same thing. The reason why this isn't that great is because most of the secrets on GitHub are buried in commit history. So when you're doing something in version control and Git, a record of that is maintained for a very long, well forever, in your Git history unless you rewrite your Git history which is a whole nightmare. So this only looks at the top level. So it's missing most of the secrets that you'll find. It's also gonna have a lot of false positives, but there's a lot of what we call GitHub dorking and we can use these and you will find some things if you have enough time, you'll be able to find it. But there's a much easier way to do malicious things with GitHub and that's abusing the GitHub API. So GitHub has an API, api.github.com forward slash events. You don't need authentication to look at this. Anyone can and there's a bunch of events on here. There's two that we're interested in, public event when a private repository is turned public and the push event when we push code. I can show you what this looks like. This is it here and you get information like we have the email address of users. This is all public, you don't need authentication. So if I wanted to target a specific organization, let's say I wanted to just scan commits made by at Twilio domains because I'm on a target Twilio, then you can filter that out using this. The credential I leaked is in this ledger. It's been published on here and this is how the attackers are finding it because they're monitoring this, they're scanning it. It's very easy to do and that's how they're gonna find the credentials. So when we say public repositories, it's easy to think that, okay, it's public. Therefore, if someone knows it exists, they can view it. But we also have to understand that it's broadcast. It's not just public as in someone needs to know you exist so that they can find it. It's on a ledger and they don't need to have any information about you. If you leak it, someone is going to find it. Here's just a quick example of a real life attack that happened with Toyota. A Toyota contractor, so not Toyota themselves, leaked database credentials belonging to a mobile application called T-Connect. Adversaries were able to find these and this was in a public repository. So what's interesting about this and why I like this example is because it wasn't even Toyota that did it. It was someone that was working with Toyota. And we have lots of examples of when source code is leaked or what I call involuntarily open sourced. So there's lots of these examples here of source code being publicly leaked. If we take one example, we could use Twitch. Samsung is another one. There was 6,000 repositories that were leaked from Twitch due to a misconfiguration and we found over 6,000 secrets when we scanned it. We found 194 AWS credentials. This is pretty typical. This isn't because Twitch was terrible. It says there's a lot of data and secrets are in source code. There's another way that we can find private information and that's doing wide sale scanning for .git directory. So when you go get init, it creates a folder called .git. Inside that is all your metadata. Inside that is your history of your project. It regularly happens that these .git folders end up on your servers. And if that's publicly available, your website's public, your .git directory is public too, which means I can not only find your source code, I can find all your source code history from there. So there was some cyber news, did some large scale scanning and found 2 million accidentally exposed .git directories which is a problem because if you're thinking that your source code is private, it's much less private than you think because even without all of this, it's cloned onto your developers' machines, it's backed up into wikis. So as an attacker, I know that I'm gonna find secrets in your source code. So there's lots of opportunity for me to really do that. Why does secrets end up inside source code? Why is this such a problem? So I'm sure no one here would hard code credentials into their source. I'm sure no one would do that, but why does it happen? I'll give you the most common example that we have. Here we have a very simple .git branch that you see. You have your main branch, but then you also have some development features on there. So let's say that I say to you, hey, I want you to create an integration with Algolia. Here's a key, please create this new feature. So you go off on a feature branch and the first thing that you do, just because you wanna test this, right, is you add in your secret, that green dot there, that you've added in your secret into that, just to quickly test it. You're on your own branch, no one's gonna see it, it's fine, you're testing it. Right, it works. Now you're gonna remove it. You remove those secrets and you put them as an environment variable, or however you handle them, and then later on it comes to code review. Your reviewer's not gonna look at all your history, at least I haven't met a reviewer that does. Maybe you do, that's fine, but that's a lot of work, but they're gonna compare the latest version, which has no secrets in it, with what's happening in the main branch and make sure that that's going to work well. They're not gonna go through the history, but that's where you have a secret. So this is why we have so much secrets in our Git repositories and in our source code that we don't know exist, and this is why attackers are after them so much. We also find secrets in logs or auto-generated file lists so that you're doing a debug, you've got a problem, so you dump out your environment in that debug log and your environment has environment variables, which are secrets. We find them, if we don't have a .gitignore is a very simple way of preventing certain files entering into your Git stash and your Git repositories. If there's no .git file, then obviously those are going to enter in there. We find lots of weird things when you do wildcard commands like git at all, if you've got secrets.txt or whatever file in there, you go git at all that gets captured, put it in there. In templates, so if you create, I don't know if there's any Django developers here, but when you create a Django, it automatically creates keys and pushes it in there, unless you know that they're there, then they can end up in your directory and even if you remove them later, they're still there. And then the other one, the main one is that we find, people just find it convenient to share secrets on Git, so they just put them in there in an ENV file because they think that they're protected by authentication, but hopefully as I've just kind of proven, source code is not as private or as secure as you expect. All right, so I want to move away from just source code and I want to start looking at some other technologies that we can find secrets in. So hopefully everyone here is familiar with Docker. If not, it's like a mini virtual machine that you can package your application and its dependencies in. And there's a place called Docker Hub, which contains most of the Docker images. There's more than 10 million publicly available Docker images on Docker Hub. And so we wanted to have a look at how many secrets were in there. Now Docker, like some other ones, we find huge amounts of secrets. So almost 5% of the images on Docker Hub contain at least one plain text secret that can use it. This may be for a package manager, it may be for your application. It's typically different types of secrets than we'll see in source code because it's usually more related to the infrastructure than it is to your services because hopefully you've removed all your API keys from here. But there's still a huge amount of Docker images. And I don't have time, but sometimes I like to do a demo of actually breaking apart a Docker image and looking into it. Because a lot of people think that because something's not human readable, that it's not, that it's secure. But this isn't the case. Docker, you can break it apart, you can decompile them, and you can look at all the layers that are made to build up for it. And if you're interested in a cool tool to do that, it's called Dive. So let's have a look at an attack that's happened because of leaked credentials on Docker that also involve code repositories. So CodeCov is a code coverage tool. It tests how much of your credentials are being, it tests how much of your application is being tested. So it sits in your CI CD pipeline. It does a small job. It's not that critical, but it's important. So what happened? When you use CodeCov, you run their application in your CI CD pipeline using their Docker image. On their official Docker image that was publicly available that people were using, they had a hard-coded credential. That credential gave access, I think it was to a Google storage bucket, which contained a bash uploader file. Attackers were then able to edit that bash uploader file to turn CodeCov malicious. They did something very clever. They added one line of code that said every time CodeCov is run, I want you to dump all the environment variables and I want you to send them to me, the attacker. So when we're testing our application, we need to build it. We need these secrets in our environment to be able to connect to everything and make sure it's working. So all those secrets are in our environment. So when we dump our environment, we get all those secrets. Now, if you're smart, you're using different credentials for testing then production, but there's some credentials you can't avoid using, namely the credentials that the attackers were after were your GitHub or Version Control System authentication credentials. So this gave the attackers access into 20,000 of their customers' private code repositories. Now, some of these, Twilio, Monday.com, Rapid7, HushyCorp, all had their private source code exposed because of this. So again, you think your source code is private. Here's a supply chain attack that gained access to their source code as well. I don't pick on companies too much, but this is one that I will because I think it illustrates a good point is that HushyCorp creates a secrets manager. Probably the best secrets manager available on the market is called Vault. HushyCorp is a great company with amazing security posture. The whole reason and their whole pitch behind Vault is that Vault reduces the need to ever touch credentials and therefore you won't have secrets inside your source code. If you use Vault, you won't have secrets inside your source code. Because of the CodeCov incident, HushyCorp had their private source code accessed and guess what they found? They had to report that they had secrets inside their source code because of it. So if HushyCorp had secrets in their source code, no one else has any chance of being able to solve this problem. All right, so moving away from Docker images, I want to now talk about another thing and that's mobile applications. So again, what is a mobile application? So you go onto the Play Store. You look at your applications and what are they? So similar to a Docker image, you assume that these are non-human readable, they're packaged up in some black box, therefore they're secure, right? Definitely not the case. What are mobile applications? They're glorified zip folders. In the case of Apple, it's literally just a zip folder. So the extension that they compiled to is .APA for Apple, .APK for Android. And these are easily reversible. So how can we reverse an Android application? Very simple. We can download it on our computer using a simple tool called GPlayDownloader. We can decompile it with a tool called RedX and then we can scan it with a secret scanner, in this case I'm using GGShield to do it. So this is the workflow to be able to find secrets inside an Android application. Literally anyone can do this. It's very, very simple. You don't need any special skills. All the tools are available. Super simple. I have a quick demo. I probably don't have time of actually just how simple it is. I took a random mobile application and from the Play Store and I broke it apart. We'll just skip forward. And then I scanned it. Once I had decompiled it, I scanned it for secrets. And if we skip forward, you'll see that we find a lot of secrets in here, including Google API keys. And if we go to the top, we'll get to a lot. So these are all secrets that we've found in this mobile application. This was a real application. It's not particularly bad. This is just what it is. You'll see that we have valid Slack web hooks. So potentially I could post some internal messages, try and trick your users. We have valid Google API keys in here as well. So we don't need to go on to too much. But that's just to illustrate how simple it is to decompile and scan for secrets in mobile applications. Apple's even easier. So again, a tool to download Apple. When I say they're glorified zip folders, how you extract an APA is you just change the extension to dot zip and then you just extract it. And then you can scan that for secrets. So how many of these secrets do we typically find? Well, first let me talk about a real life example. This is from my friend Jason Haddix, who's an ethical hacker. And this was an exploit where he found for a bug bounty. So there was a bank application. We're not allowed to say what it is. But this bank had a mobile application with it. One of the features of this bank, an American bank, one of the top five, was that you could take a picture of a check and then with the app and then cache that check. What, by looking at the code and decompiling it, what he found is that these images weren't being encrypted, they're being stored on the phone's memory. He then found that these were being sent to an Amazon S3 bucket. He found the keys to that Amazon S3 bucket hard-coded in the mobile application. And then, Wamo, he found 10,000 images of checks in plain form on an Amazon S3 bucket that he had access to. So this is an example of kind of showing, this is a bank, right? You wouldn't expect a bank to have hard-coded credentials for something as sensitive as this. But this is the state of the world that we are. And in fact, huge amounts of mobile applications have secrets. How many? So my friends at CyberNews did a full study on this. Oh, did I remove them? OK, I did. So about half, they found out that half of mobile applications are on the Play Store, contain secrets. So huge problem, huge problem that we have here that everyone is facing. So as an attacker, I have lots of opportunity to try and gain access to these credentials using different ways. I want to go quickly back to the demo that I did and let's hope that we have lots of things. So I have a Slack channel here. Every time someone has tried to abuse my credentials since I've been talking, it's posted here an alert that someone's tried to use those credentials and given me their IP address. So if we look, we have this first one was me testing it. But here we have already, this is the IP addresses that we have. So there's a few different ones in here. So in this period, because it does it in five minute periods, so in this period, we've already had about seven attacks from it. And then since then, we've had another two and another two. So about 10. About 10 bots have tried to exploit the AWS credentials that I leaked in public GitHub in the last 20 minutes. So this is how big of a problem it is that if credentials get leaked on GitHub, they're going to be found. So what's actually going to happen with my credentials? Throughout the rest of the day, I'm going to get a lot of activity on this. So you'll see here that every time someone does it, it gives me their IP address. And it also lets me know what they've used to do it. So here it's doing a call to get my identity. It's basically checking that these are valid. It's going to come back that these credentials are valid because of the honey tokens are marked as valid. And then what's going to happen? Usually what I'll see is that I'll stop getting activity after a couple of days. And then a month, two weeks, two months later, I'll get another spike in activity. What's happened in between that, because we can track it, is that these credentials get bundled up and sold on dark web forums. So what's actually happening is a first group of attackers is really good at discovering credentials, right? But they're not very good at doing stuff with them. So then they sell them to a group of attackers that no one to do. Perhaps they want to create crypto mine. For cloud keys like this, this is often what they'll use for DDoS attacks, is that they'll gather lots of credentials that are valid and then use them to do malicious things. Or they might be looking for specific companies. So my email address in my GitHub is at GitGuardian. So if they wanted to attack at GitGuardian, perhaps they could bundle them as, here are all the credentials from Slack, from GitGuardian, from Twilio, from these people that work at these companies. So that's what an attacker is going to do. So how do we prevent this? So first of all, we've got to stop hard coding credentials. This is really the easiest one that we can do. Up here we have an example. You have your API key, it's in there. Even if this is just a test, even if we're just wanting to see if, just to test that these work, we should never do this because it's going to be in our history. And if you've ever had the experience of trying to rewrite history in a group project, you'll know the unbelievable pain that comes with that. So once they're in there, they're pretty much in there. We need to use the correct secrets managers and this is not always the best. So the best secrets manager, as I've talked about, is probably Hushikook Vault. Maybe there's some other ones that you could argue just as good or better, but that's really at the top. The problem is that this is very heavy. If you've got a group of five people working on a project and you want to use Vault, you basically need one of those five people just to manage Vault and just to manage a secret server. It's very heavy, it's very complicated and then what's going to happen is that you're going to get sick of using it so then you're going to store secrets.txt on your homepage so you don't have to deal with it. So maybe Vault's not the best solution. Then you can kind of go to SAS versions of Vault, one for Doppler that's not up here, you've got A key list, one password has a great secrets manager for developers as well with cool stuff like VS code integrations. Then you can use, if that's too heavy, you don't want to have a dedicated area for secrets manager. If you're hosting it on the cloud, there's secrets managers in here. These lack a lot of the features, but at least if you start using these, you'll get an idea. It makes it difficult to share secrets. And then the last one, and every security person will tell you that this is a bad idea, that this is terrible. I'm one of the few that will say it's okay. So this is encrypting your secrets and then storing them in an encrypted file on Git. This is a terrible idea for a few reasons. It gives you a single point of failure. That encrypted file is going to sprawl with your source code so if it does get cracked, you have a problem. However, why I say it's okay is that if this is what it takes to get you from hard coding your credentials, if this is the lowest point of area that I can get you to actually do it, then do it, just do that. I'll work on the rest of the stuff later, but if you can just encrypt them to start with, then that's really where we need to start. Using automated secrets detection. Secrets, at this point, there's a lot of secrets detection tools that are really good. A lot of them are open source. I work for a company called Git Guardian, so I'm totally biased in anything I say about them, but we have commercial tools available with dashboards and stuff. We also have some open source tools, but there's lots of other open source tools. And so it really depends on what you want to do, but you know, traffic hog, Git leaks, these can all detect secrets. And as I get, if that's the start, just to use some tools, all of these can be used to create things like Git hooks to prevent you from committing secrets. And they can also be used to scan your directories, and I've used GGShield to be able to do, scan the mobile applications and other things like that. And just some final thoughts. Rotate your secrets regularly. Don't use long-lived secrets. The added benefit of rotating regularly means that you know how to do it, because when a secret gets leaked, you might be alerted. You might be traffic that you're unfamiliar with, but no one knows what that secret does and who's in control of it and what happens if I rotate it. If you have a rotation policy, then you're gonna be good at it, so if you do have a breach, it's gonna be better. Limit your privileges, stop creating admin tokens. If all you need to do is read information, make sure that's all that you can do with that key. Whitelist your services, so if you know that this service meant to be talking to this service, if you can make it so that only that can happen. And that's pretty much it. So here are some QR codes. The state of secrets for all is the report that all this information is in, so you can download that if you want. And here we have a white paper on how to manage your secrets and give you a benchmark against other people. But thank you all for coming out early and watching me, and if you have any questions, I'll be gladly to take them now. So thanks, guys. Any questions? Yes? It's possible that some of them are actually just lost friends who are friends with you, like who are lost friends with you because we didn't just talk about it and... Yeah, yeah, so one of them definitely is. So one of them, could I leak AWS credential? Oh, yeah, sorry, thanks, yeah. So the question was, in the information that we have on the Slack channel, could some of them be good and are all of them malicious? So I don't know about all of them. I know that some of them are malicious because we can monitor it, but some of them are good too. One of these IP addresses is going to be from Amazon themselves. Amazon is actually one of the companies doing the most to prevent secret leaks, and they are looking on GitHub themselves. And if they find a key, they will actually try and alert you to the fact that your key is leaked as well. So there's definitely going to be some of them, but there's thousands of credentials. This is one case that Amazon is doing particularly good at. And yeah, there used to be some other services, SHH leaks or something like that would be doing it. And this was kind of like a gray service where it wouldn't alert you, but you could see all the secrets. So yeah, some of them are, and you'll probably notice it, like some of the IP addresses are the same, like these ones. So it's hard to know, but definitely some malicious activity and definitely some good activity as well. Any other questions? No, no problems. If you want to learn how to make honey tokens like this, it's incredibly easy. In 10 minutes, I'm doing a workshop on how to do it, where I run through where it's just an open source tooling and it's a lot of fun. So if you want to know how to make these, yeah, I'm running a workshop in about 10 minutes in A218, I think. But yeah, thanks everyone for paying attention and I hope to see you again soon. Okay, then is it on? Can you hear me? Okay, great. Good morning everyone. So I'm Martin, this is Liz. We are both from Red Hat Cockpit Team. And I must say, thank you so much for coming here. We are competing against Sunday and the social event and the system detour. So I'm actually feel kind of honored that we have such a great attendance today. And so for this talk, we assume some basic familiarity with Cockpit, but if you have really never seen it, here's the short short version. So Cockpit is conceptually a Linux session that runs in your web browser. And we try to make it to be the mobile equivalent of what you know is to a desktop. So this is the UI for your server. And so it is a tool for experimentation, for learning how Linux works, for newcomers. Also for troubleshooting, you put a lot of effort into that. And also for doing infrequent tasks that you don't keep in your head, like how do I resize my LVM or something. Ah, yeah, I'll talk for this. So to understand this talk, you need to know a little bit about how Cockpit works internally. So consider what happens with a normal interactive SSH session at the bottom. So you want to do stuff on some remote server which is sitting out there. And this stuff usually entails you want running programs or doing something with files. Perhaps you want to talk to a TCP port and so on. But everything that SSH gives you is essentially a pair of textual pipes standard in and standard out. So you need something to put in between that translates between these two text streams and all the operating system interfaces. And for a normal interactive SSH session, this is usually a shell like bash that most people use. And now Cockpit is a web UI written in JavaScript, but it's actually in the same situation. It is running in a browser, possibly on the other side of the world. And the only thing that it has is a protocol called WebSocket. For the purposes of this talk, it's essentially the same as a two-way pipe. So you also have the text stream there and text stream back. And for Cockpit, this thing in the middle which translates in between these two sides is called Cockpit Bridge. And that is a thing that is essentially a multiplex JSON stream that translates all these operating system interfaces to the WebSocket protocol that the user interface can understand. So how does that look like? Stand for demo. This is the Cockpit Flatpak. If you have never seen it, we believe it's the easiest way to consume Cockpit. So it is essentially a very minimal web browser, webkit-based, and the Cockpit web server, and an SSH client wrapped into a Flatpak. So here you can connect to pretty much any SSH target. So you can give it an IP, host name, SSH alias, username, and so on. So let's try what happens if I connect to my Fedora server. I don't have SSH keys set up, so I just enter my password here. Super secret, poobar. This is dark mode mode. Okay, and so here you have the, might look better on video things. So here you have the familiar Cockpit user interface where you can do things. And I don't want to go too much into detail here because I assume you already know that. And if you look here what's running, you'll see there's the SSHG process and the first thing, this Cockpit bridge thing, I've just been telling you about it. And we opened the terminal page here. So the terminal bridge is this batch process. That's the thing I'm currently in. And I was running the PSFX, so far so good. And this works, oops. So this works because the Fedora server has a bunch of Cockpit packages pre-installed. Okay, so now let's put ourselves in the position of Cockpit and SSHG this manually. And what we can do is we can run the same Cockpit bridge in some kind of slightly easier to use from a human mode. So the Cockpit bridge works, as I said, it's a multi-text JSON stream and it works in terms of channels. So what we can do is, for example, run a program. Ooh, what do I mark stuff here? Okay, so we can paste that in. Sorry, it's not on the top. Ah, yeah. So what we see here is we open the channel and we get a bunch, oops, yeah. So we open the channel, it worked. And then we got a bunch of standard outblocks so as they come in, so it's coming in chunks as we know from pink. And eventually the command finishes and gives them an exit code and says that it's done. And the bridge has a lot of these channels. So there you can do file operations, you can do debas calls, or for example, a more specialized one would be metrics. So this is the second one, oh yeah, thank you. So here we open a metrics channel that measures the current CPU usage. Of course, the numbers here are very small because this VM doesn't really do anything. But I hope you can see by now this is the idea of what cockpit bridges are the guts and everything that the UI does is implemented in terms of these channels. Okay. We can limit other things that aren't Fedora, right? Absolutely, of course. I mean, we support cockpit on a lot of operating systems. So like Debian, Arch, Ubuntu, and OpenSuzer and so on. So of course, we can also, where's the platform? I stopped that. I think you did it in this one, but you have to do it. Ah, sorry, yeah. So I just, for the talk, I brought up with CentOS 9 stream cloud instance. So let's connect to this. And of course, cloud instances usually have the SSH key set up. So we don't even need to type in our keys. And of course, as you see, it's the same cockpit interface that you are used to, and it's the same ease of use, and hold on, hold on, hold on. This is embarrassing. What, what, what? Come on, let's talk. Oh, really? Oh, you forgot to install cockpit on the server. This is totally straight up our talk, man. Yeah, really, no, this is, yeah, this is a bit sad. I mean, this is one of our platform products, right? I mean, it's real. I mean, I mean, it's cool users and customers, but I think this makes us look seriously uncool. Okay, we need a solution, we need a fact. Okay, what language did you say this cockpit interface is written in? So, what can we do here? I mean, you need to have that bridge pre-installed somehow, right? I mean, it's a C program. So we can compile it. Yeah, but I mean, it's a C program so that it's performance, and we can talk to low-level system interfaces, and like, but I mean, this doesn't work. I mean, how do we get the bridge there without having the bridge? It's taking too long, man. Like we have, like our talk, he's gonna start showing like the 10 minute sign and stuff. Okay, so what do we do this? We could rewrite the bridge in Python. In Python? Yeah, no, no, no, no. Let's make it easier, I promise. No, no, no, that can never work. I mean, everybody knows Python is way too slow, and I mean, the deceiver is just thousands of lines of code that will take us years to re-implement this, and anyway, I mean, even if you do rewrite it in Python, how do you get it to the other machine, right? I mean, seriously, people, I ask you what that is, what has the Python empire ever done for us? Are you nothing, right? There was nothing. Nothing, really? Yeah, that's what I'm saying. How far, hey, it's portable. It's portable? Yeah, okay, it's portable. Yeah, I give you that, but I mean, otherwise we're really much, right? It's kind of like, it's really easy to write async code. Like, I think there's a lot of that in the Kafka bridge, and like, we have all these callbacks and it's annoying, and you don't have to deal with that with Python. Yeah, but I'm sure that's way too slow. No, it's actually performant. But it's incredibly fast. Really, yeah, okay, so it's fast enough, and it's portable? Yeah, yeah, yeah, yeah, okay, but still, I mean, this can't be good enough, right? Yeah, okay, it's efficient and fast to develop, but yeah, well, I know, I'm really skeptical, right? Also, like, it's not the 80s anymore, man. Yeah, okay, but aside from being in modern language and being available everywhere and being portable and being easy to develop in, and being synchronous and being really fast, I mean, what has the Python empire ever done for us? Okay, let me slow you down, yeah. I wrote this program actually in Python. It's called Hello World, maybe you heard of it. What? Yeah. You're a super lead. Hold on, I'm gonna have to show this to you. It's, yeah, and you can actually run this program in places where it's not installed. Hello World! Yeah, it's super complex, I know, but let me give it a shot. Yeah, yeah, yeah, do it there. So we use this, we have this idea of running programs in places they aren't installed. Where have we seen this before? We actually have the ability, quite normally, that you could have a program not installed on your computer. You go to a website, and the website wants you to have this program. So it uses this mix of pervasive technology in the form of a protocol, which is HTTP. Everybody has HTTP on their server or client. And then we have this absolutely ubiquitous execution environment, which is JavaScript, HTML, CSS. And then you can get applications from one place to a place where they're not installed and people can run them. So we sort of thought, like, maybe we turn this idea on its head. And we can use a ubiquitous protocol, SSH, and a ubiquitous execution environment, Python. And we can use this for having programs that live on the client that get sent to the server. So the client tells the server what it wants to do, basically, it's the other way around. And this particular stack of technology, like SSH and Python, that's like Ansible World, this is pretty widely supported on almost any server, except extremely minimal ones. And we built some tools for helping us do this. So Byvote is the first of these tools, and it's basically a way of taking a Python interpreter. So anywhere you can get a Python interpreter, you can then run an interactive Python program in it. And this idea that an interpreter can be running in a different environment, what is this meaning? So we have all of these kinds of commands. Probably a lot of people are familiar with these that are basically run some command, somewhere else, in a different context, and connect me like the input in the output, and let me see what's going on with that command. So the one we mostly care about, in the case we're demoing here, is SSH, but also like, you know, sudo do something, is running the same command, but is root. Or you could run it inside a container, or if you're familiar with this command, if you're inside of a flat pack, that gets you access to the host system from inside the flat pack, if you have the right permissions for that. And the command in question that we might be interested in running, why not Python? So now if you look at all of these commands when you run it, they all basically present you with the same interface, which is a Python interpreter, and you can type stuff into it, and you can see what comes out of it. And this is what By-Boat needs to do what it does. And the next technology we have that enables this is called By-Pack. And the problem is that you have like a Python program, and it's like a billion lines of code split across like a bunch of different files. Maybe you use some modules. You can't just put all of these files into the Python interpreter and expect it to work. So we basically have a way of taking a complex program, split over many files and modules, and turning them into a single Python script. And let's use some loader magic in importlib. It basically is its own importer. And yeah, you end up with a single script which has many files inside of it. It basically takes all the files, puts them into a Python dictionary, and then puts that into the loader, and then adds loader to the path. And we have, as I was saying, like this hello world demo. Oh yeah, and we know zip app exists, but there's some problems with it. Like you have to write it to the disk before you can run it. It has to be like on a physical path that's on the disk. And we wanted to just send it over SSH and go. So here is this demo that I put together. We have an app, and it's like a standard hello world kind of thing, but it's using libraries because getting the Python version is pretty complicated. Getting the name of the OS you're using pretty complicated. So of course we need a library for that. We like modularity. And that library is just looking like this. And it has a bunch of functions in it. So if you were to try and run hello.py like on another machine, you would need to make sure at least these two files are getting over. And that's not the kind of thing you can do in the interpreter just sending it with standard in. So what we can do then is if we run this command, then this creates a bypass of these two files. And this can get quite complicated. The things that BiPAC can do are modules. It can actually like do a PEP 517 build of a source tree. But for the sake of argument here, we just do these files. And just like the zip app, if you're familiar with it, you can say here's a bunch of files. They form like a Python library, but what are you going to do? I import them all and then nothing happens. You need a place to start out. So you can say here's where the main function lives. And it's basically saying put it in here. And if you look at what's in here, you see that it's actually this code and it gets stored into a dictionary form. Here's hello.py part of the dictionary. And here's the info.py. And this is all just in one ginormous dictionary. And that gets passed to this BiPAC loader which implements some import lib magic that lets you import your program and run it just like a normal Python program. And this turns out to be pretty flexible. You can basically do everything you could do in normal Python. You can do imports of other modules. You can do like even if you have binary data files and you use the resource loader in import lib, this is working with this. And we make use of all of these features which is quite nice. Yeah, so I mentioned before that we have BiBot. And this is this thing that works with the BiPAC. And the idea is when you build your software, you would make a BiPAC ahead of time and this is sort of like your main deliverable. And then BiBot can consume this BiPAC. So, up. And then it can deliver it to different environments. If you just run BiBot on its own, it does the thing that I mentioned, it gets you a Python interpreter somewhere. But if you run it with this, then it will run that application somewhere. And by default that somewhere is here, so it'll just run the Python interpreter here. And yeah, so you notice the lizard toolbox, that's just the... Yeah, so this is in the current environment in the toolbox. And what's interesting to note is that we're blasting this program over standard in, but the standard in is still available for interaction with the user. So it basically, we use the standard in to get the program up and running, but it's not like we just cat the Python script into standard in because then you send EOF and it's game over. So we have this sort of multi-stage bootloader process that we can continue communicating with the application after we boot it. And yeah, as promised, like you can SSH to the Fedora server. And then we see that we're logged in here. And I'm still Rupert, I guess. And I can do this with sudo. And then you can see now I'm rude in my toolbox. And I mentioned before you can escape the toolbox. And now I'm out on my laptop. You can see that it's on silver blue on my laptop. And this is basically like this core enabling technology that we wanted to do in order to make this possible. And one feature that we have in this, which is kind of like it's a little bit crazy to talk about, but we gave the machine the ability to reproduce itself. I know this like I've seen some sci-fi movies that usually goes poorly, but this is important for our case because we have the case that when you SSH to a remote machine with cockpit, you have cockpit running as your admin user. That's nice, but most of the stuff you need to do, you need to run it through. And the way that works with cockpit is that there's a concept of a peer bridge and cockpit will start sudo and then run another bridge under it and run it basically the same program again. And if it's installed on the local machine in user bin cockpit, that's great. You can just say sudo user bin cockpit. But here we need to take this program that's running and has no files on disk whatsoever and run something over sudo. And the thing that we can run over sudo is again another Python interpreter. And then the first program can pass its own code to the second program. And the way that works is by boat has a stage one boot loader, which is the very first thing that gets sent to the Python interpreter. And it takes the source code that it ends up downloading from the client more or less. And it passes that in this special variable, which is recognized by the BIPAC loader, if that's present. Then that dictionary, which I showed before, it gets added with the same name that it had on the host back to that dictionary. So you can have a program that says, okay, import this BIPAC, send it somewhere else. And then from that place that the copy of this program is running, it can further import that same BIPAC in exactly the same way and send it on further. So I think we're ready to demo how we use this in cockpit now. So yeah, remember that CentOS line stream machine that was so stubborn for us to log in? I think, I know where I'm now. So, what was the test? It's on the next one. It's on higher. All right. So let's close this old and non-functioning one and use the super cool one. This is, yeah, this one is currently available on the beta channel if you want to give it a try. So you log into CentOS line stream and magic happens, we get a cockpit. It's apparently not super healthy, one services fail, but you notice that it's got a lot of, a lot more available pages here. And yeah, this is magic, right? Yeah, it's pretty cool. Let's see what's running in the terminal. Yeah, absolutely. You see now, no cockpit bridge anymore. It's now this, like this is exactly the Python interpreter which got all the bridge piped in. And that's it, right? I mean, nobody else does need more features. We could become root. Root, oh, that's even more magic, right? Well, let's see how that works. Yeah, the machine can replicate itself. And I am root. I says I have administrative powers. Let's check that. Let's say we change the host name to, I don't know, something. And yes, it changed it, we are root. And yeah, this is basically it's calling pseudo and then this is running back in the bridge, just as we described. Liz, I think that restored our crew on this, right? Okay, so, where can your dad get this? So, and what is our plan for this? So we've been developing this Python bridge in the last, for the last couple of months, kind of on the side, on the main branch of cockpit with a configure option. So all the releases that you've been getting in between, they were still using the trusted old bridge. But just last week, we fixed the last critical regression that we noticed and we felt it is now finally time to unleash it to the public to get more field testing. So just three days ago, we released this Python bridge provided to Fedora Rahai, to Devian Unstable, soon in testing, hopefully, and thanks to Yellow also to Arch. And so we now want to let this settle down a little bit, collect regression reports, and as soon, like, as long as nothing dramatic happens, we will also want to soon release the Python bridge to Fedora 8 and RHEL and CentOS 9. But we will, like, we want to be cautious, so we will not change the Devian Stable back ports, the Ubuntu LTS back ports in the RHEL 8. Updates because these are long-term support releases and yeah, we are still not completely sure of ourselves. But that Daiboot magic functionality that you've seen with our demo, for now, this is only available in the flat pack. For now in the beta, that's released, you can try it out yourself from the flat pack beta channel, and hopefully soon also in the regular flat pack. And we also want to deploy to the cockpit WS container. That's sort of the Kubernetes cloud server-side equivalent of flat pack, if you cannot run the flat pack, for example, you have a Windows or mobile client. But of course, all these operating systems that I mentioned, they will be supported as connection targets. So you will be able to connect to Devian Stable or RHEL 8 with the flat pack, which to these operating systems, which don't have any cockpit packages installed. And yeah, so, oh, and of course, these distro and the container worlds, they are separate use cases for now, because we're still having some discussions and try to figure out where we want to go with this. So there is some, let's say, colliding architecture design decisions that we need to do. But for now, we are mostly looking for feedback. So, and if you have any questions or you want to try it out and run into trouble, please don't hesitate to contact us. So we have a homepage, yeah, this one. We have a release block, the last, yeah. So the latest release block will show you how to install the flat pack and of course, give all the other contact information like our repository. You can find us on Matrix these days and the mailing list and so on. So, time for questions, I think. So the question was, can you run PSFX again with the pseudo bridge? Of course. You do just FX, you won't see the pseudo one, right? Yeah, but I think with PSFX A, so you see, we've got this nice tree here. So, this is the original, oops, so this is the original, it's a trial again. No, this one, yeah. This is the Python bridge and now this is the magic where we run pseudo and then have the Python interpreter again. And of course, that's the thing that runs as wood. I think it won't show the user here, but yeah. So that's the tree that we expect like this kind of staged, self-applicated thing. Achilles? It says, by boot, work, output. Sorry? You put a library that happened. No, repeat the question. All right, the question is if you have a Python library that has like C code in it, for example, if by boot is working with that, the answer is definitely no. We made this step specifically for avoiding this case that we have to compile stuff. And then I'm pretty sure that if you had like a SO file or something that needs to be on the file system for the dynamic linker to find it. And yeah, remember, I mean, the only assumptions that we want to make is SUDA SSH and Python. So I just think you log into it, this could be an ARM machine or like an S390 server or who knows. So maybe there are some cool tricks you can play with, how is it? Anyway, no. Who starts with this system question? Who starts with Python? So the question is how we start services from Python? Yeah, so this is maybe a bit of an interesting question that drives into the architecture of what we did in the Python bridge. We do most stuff in Cockpit Bridge over D-Bus. And this was one of the first components. We're like, okay, how are we gonna end D-Bus from Python because this is not part of the standard Python library? And we decided something that's pretty universal. I mean, Cockpit's not gonna do very much with that. It's actually system D. And LibSystemD has a D-Bus library inside of it, which does not contain Python bindings. But what we did, and maybe this also is interesting for you, is we have a fairly comprehensive binding of D-Bus using system D, using C-types. And this runs everywhere because it's pure C-types. You just need to open the library file, which we assume it does not, and then you don't need to compile anything. It doesn't. So the question was, how does it run on embedded systems, which doesn't have system D, but yeah, that's outside of a use case. I must also say Cockpit in general doesn't run without system D. So there's a lot of... Yeah, everything here is like network manager and like pretty high level stuff. And half of the system and the overview page is system D. Would you recommend or use like the last package in format to just distribute? Yeah, would you use, I say BIPAC as a packaging format to distribute programs. I'd say like BIPAC is, you could, like you can take a BIPAC and just cat it into the Python interpreter. And if you don't care about standard in working, then that actually works. I feel like it's very specific as a format to be used with BIPOC though. Yeah, my gut feeling is treated more like a compiler than like a distribution model. Yeah, like you would BIPAC during your build process and that becomes part of the deliverable that you deliver through another mechanism. Thanks very much. So we exhausted everyone, well, thanks for your attention. Can everyone hear me? Okay, great. Okay, hi everyone, welcome to this talk. That will be an introduction to 6.04 Python status. My name is Maya, I'm a software engineer at Red Hat. I work on the emerging technologies security team and you can find me on social media on Twitter, master and GitHub under those handles. So this talk will be about supply chain security. So I would like to try to answer two questions. What are digital signatures and why are they so important in this context? So maybe you know this information but software supply chain attacks have increased of more than 700% during the past three years which is kind of a huge increase of course. And a lot of these attacks have been targeting the Python ecosystem and in particular the Python packet index. So attackers use techniques like typosquadding or dependency confusion or also taking possession of a maintainer account on PyPI to try to inject malware in important libraries. So this increase in malicious package upload was so important that recently PyPI maintainers decided to temporarily suspend new user registrations and package uploads because they got so overwhelmed by those attacks that they couldn't handle the volume of malware anymore and just had to temporarily deactivate this important function. So we're talking about software supply chain. So I would like to try to define what exactly a supply chain is. So for the sake of this talk we can say it's like the end to end journey that software takes from development to distribution and that involve all the tools and the people that are responsible for delivering the software. So for instance that is developers, version control systems, built systems, registries, package indices and also deploying platforms and production environments. So when you have a supply chain what attackers play on usually is the expectation of developers that every step in their supply chain is going to be systematically reproducible which of course is not true and that creates some kind of vulnerable links that attackers can exploit to upload malicious software in your supply chain. So why are cryptographic signatures or digital signatures important and a key component of every secure supply chain is because they offer two guarantees. It's software integrity and software authenticity. So if I were to make some kind of analogy here I would say cryptographic signatures are like a wax seal on a letter. So it ensures two things usually when you open such a letter. You can see if the contents of the letter were tampered with, if the seal was open and also the pattern on the seal can allow you to identify uniquely the center of the letter. So it's the same thing with cryptographic signatures. Of course before 6-Tor other signing tools existed and other standards. So I think the most famous one was probably open PGP or in supplementation GPG. But those standards and in particular PGP had a bunch of problems that prevented a good developer adoption and didn't allow developer to use code signing as a day-to-day tool. So the first one is public key distribution. So public key distribution is the act of ensuring that your end users are going to be able to identify correctly the public key that you generated. So we're talking about asymmetric cryptography. So you have a private key that you use to generate the signatures. You have to keep secret. And the public key you need to distribute for people to be able to recognize your signatures. And in the case of open PGP, you have different methods to do that. So it's called public key infrastructure. So it's the infrastructure you put into place so that users can identify your public key. But the problem is that it's not very standardized. So usually standards for discovering public keys can widely differ between PKI's. So for example, you can trust more centralized certificate authority to say what are the good public keys to bind to a signer's identity. Or in other models like the web of trusts, you can trust users. It's more decentralized. So other users can vouch for their other users' identities. So I put this picture in this slide. This is called the key signing party. And it happened, I think in front of FOSDEM 2008. So it's kind of a specific way to verify other people's public key in person. And maybe you will agree it's not very convenient. So if you have to verify a public key by meeting your colleagues in person, it's maybe not the best signing and PKI scheme. So it's quite a specific example, of course. Not all standards are that inconvenient, but still it's kind of nice illustration, I would say. Another problem of OpenPGP is private key storage and rotation. So private key is a very important component of asymmetric cryptography because you have to safeguard it at all costs. And that means literally in terms of costing money. So you don't want your private key to leak at all. So you need to invest in some kind of secure storage, for instance, like hardware security module to guard your key, which is very costly. So you need to invest in specific infrastructure which also implies specific knowledge about it. So maybe hire people especially for this kind of things. And you also need to regularly rotate your private key because compromises are pretty frequent, I would say, and more that you can sing. So you need to also think about rotating your keys also as a best practice. If you've used GPG before, you could agree that the configuration can get quite complex. So sometimes it's difficult to really understand what you're using and especially what that involves. Understanding the underlying cryptographic protocols when trying to sign an artifact because not everyone of course is a cryptography expert. So even if you're a developer you're not supposed to really care about those things, but sometimes you need to. So that's not really ideal. I put a reference here to really good article that was published recently. It's called PGP signatures on PyPI, worse than useless. You can check the link when the slides will be published. So it explains why PGP signature will remove from PyPI. So I think the title is pretty explicit but you can still check the audits made by the blog author to understand why it was not worse continuing to maintain the GPG signatures. Okay, so now it's time to introduce SIGSTOR, which aims to make code signing easier and more accessible for everyone. So the motto of SIGSTOR is to become to digital signatures what less encrypt is to HTTPS. So I will explain what that means. So SIGSTOR has built builds in terms of philosophy on the model of less encrypt. So if you make a quick comparison between the two services, you can see that less encrypt is a free and automated certificate authority. So you can use it for a zero cost to obtain TLA certificates to adopt HTTPS for your website. And in the same way, SIGSTOR is also a free service that has a public good instance. And you can use it as you'd like to log transparently and publicly your signatures. In terms of numbers, less encrypt has stored over 200 million certificates since 2016, which is roughly three million certificates issued per day. And SIGSTOR has stored over 20 million entries since the public good instance went up on 2021 GA. So what exactly is SIGSTOR? SIGSTOR is a tool that solves some common issues with current signature schemes like the ones I talked about before and that prevented developer adoption. So with SIGSTOR, you don't need any specific cryptography knowledge or any knowledge of PKI protocols. It has a very simple interface that makes signing truly accessible to everyone, developer or not. And you don't need to maintain your own private keys anymore. So that's a big advantage because you don't have to invest in all this infrastructure and knowledge at all. It also allows an easier auditing and revocation of signatures in case they get compromised, all right, are fake, for instance, which is still pretty rare. And signatures are bound to a public identity and not to a public key anymore. So that's kind of a big change compared to other asymmetric cryptographic schemes because you can bind, for instance, a signature to something more concrete like an email address that is easily identifiable by a human, not like a public key, for instance. SIGSTOR is composed of different sub-projects. So here I put the three main ones. The first one is RECOR. It's what we call a transparency log and that's a immutable app and only data structure that allows to store signatures so that everyone can be able to verify them. The second one is FULLSHOW. It's a free certificate authority and it delivers ephemeral signing certificates you can use for one-time signing of artifacts and then that will expire and that people will not be able to reuse after you. So in terms of security, that's a pretty big advantage. And it's used usually as a certificate to verify the signature result than to sign multiple times like usual signing certificates. And the third sub-project is COSIGN. COSIGN is a command-line tool so you can use it to sign and verify artifacts in a very simple manner. So all the cryptographic primitives are picked for you already, you don't need to care about cryptographic protocols. The command line is pretty simple and you can use it to sign containers or blobs for instance. In addition to the three projects you also have a whole ecosystem-specific set of clients. So you have implementations for Golang, JavaScript, Rust and of course for Python. I will talk about it in a few minutes. So SIGSTOR has known a pretty large open-source adoption since it was created, especially in the cloud-native community since it was incubated as a CNCF. So it's pretty well known to be used to sign Kubernetes releases and also resists for some other famous projects like Kyverno, Tecton and also the Python library URL Lib3. And it has a lot of integrations with other supply chain projects like Tecton chains, the update framework in Toto and Kyverno as well. Okay, so now let's talk about SIGSTOR in the Python ecosystem more specifically. So let's cover a few initiatives that the Python community has taken to integrate SIGSTOR into the ecosystem. So first of all, SIGSTOR Python clients is available for Python users. You can do different things with it. So the first thing is use it as a common line tool. You can sign only five dozen blobs, not containers this time. Using what we call a keyless signing workflow that we'll explain later, but basically as you can guess, you don't have to manage private keys using this workflow. Instead, you use OpenID Connect. So I will go into detail about this later. You can also use it in a GitHub action workflow. So for example, in your CI, if you want to sign package release or build or anything like that, it's pretty easy to use. And you can have signatures as output of your workflow. And finally, it has a stable API since version one. You can use to integrate SIGSTOR natively into a Python project. If you want to test SIGSTOR, it's pretty simple. It's a Python package on PyPI, so you can just run Pivot Install SIGSTOR and start experimenting with it. The Python packaging community also adopted SIGSTOR. So SIGSTOR is a variant in two PEPs in particular. It's PEP 480 and 458, which concerns secure downloads of PyPI packages and also everything related to software signatures on PyPI. So the PEPs are accepted right now, not implemented, but I guess you will see soon how SIGSTOR fits into this picture. So basically, it will enable users to upload SIGSTOR signatures along their packages and modify the API so that clients can also retrieve them and verify the packages upon downloads. So package managers like PIP, for instance, also supports verifying signatures. And overall, the goal of those PEPs is still to make the user experience similar to before, so it doesn't add any overhead for Python developers when downloading or uploading packages, but it guarantees way more security and integrity. So here I put an example as well of how SIGSTOR is used in the Python ecosystem. It's to sign releases of C Python. So if you go to the python.org slash download slash SIGSTOR page to download Python releases, you will see that there are now signed using SIGSTOR. And here's the command I commit based on this site to verify the SIGSTOR signatures. So you can have an overview of what it takes. It's not that complex actually. So it's just SIGSTOR verify identity, that's the command line to verify. Then you pass the formal signing certificates, the signature file, and then you pass information about the signer who is the C Python release manager for the version. So here's the email address and then the URL of the identity provider that was used to issues this identity binding. And then you pass, of course, the python.org file. Okay, so now I'd like to do a quick demo of signing and verifying a Python file with SIGSTOR Python. Okay, can everyone see? Well, okay, great. Okay, so here we have a file, it's called hello.py, and it does nothing special except say hello.dev.conf. So it's a very simple example. So now I'm going to sign it using this keyless workflow that SIGSTOR Python enables. So I will just type SIGSTOR sign and then hello.py. Oh, okay, I don't have an internet connection, so that's an issue. So sorry about that. Let me check the wifi. Back to the demo, let's try again. So now we'll open a browser page. I was redirected to this page by the command line. So here you see that you have a login page appearing. You have different identity providers here, like GitHub, Google, and Microsoft that are officially supported by the SIGSTOR public instance. So I'm logging to GitHub, so I'll just click login with GitHub. And normally I already have a session open, so I would have to normally input my email address and the GitHub password, but here I'm already logged in, so the authentication was successful. I can close this page. And back to the demo. Okay, so let's have a look at what we got here as an outputs on this command. So we finished the browser interaction. We got an ephemeral signing certificates here in PEM format, and then you can see this line here. So it say that it created an entry at index 24 million something in the transparency log. So this is RECORE, I talked about earlier. And also we wrote a bunch of files. So a .sig file, which is a base 64 signature for the artifact certificates, which is the one printed here. And also a file called a SIGSTOR bundle you can use to verify your signatures directly without needing the signature and certificate. So let's take a look at what is in the ephemeral certificate. So you can see that the issuer is SIGSTOR. It's more specifically the SIGSTOR public good instance certificate authority for show that issued the signing certificate. Other than that, you can see that the SAM is my email address. So I used for authenticated to GitHub. And the identity provider is GitHub Outs. So this URL here. And if you look at the normally at the time, yes. So the not after for the validity is in 10 minutes. So the certificate is only valid for 10 minutes to sign any artifacts I want. So like right now I could reuse it, but in 10 minutes it will be over and will just be approved that the certificate was issued to me with an ephemeral public key that was generated to sign signature and it was binded to the certificate by full show. So an official authority approved, like confirm that my GitHub identity was bound to my public key, which is bound to the private key, which is a ephemeral I used to sign the artifact. So now let's try to verify the signature. So here we have all the files. We got from the signature operation. And now let's try to verify it. So I will just type SIGSTOR, verify identity, identity. Then I will pass cert identity. So that's my email address, the SIN on the certificate. So I'll pick that. The next argument is cert YGC issuer. So this is the identity provider that was used. So here GitHub to provide the identity. So I'll just take it and paste it. And finally I will just pass the artifacts, which is held up by, and the files will be found in the same pass. So SIGSTOR just detects this verification materials we have. Okay, so it says okay. It means the signature has been successfully verified. So as you can see it was pretty simple. You didn't have to configure anything specific, no cryptography, no complex things at all. So that was it for the demo. So now I think we have a bit of time to talk about how exactly that works. So I will try to go over the workflow. Feel free to go over the slides again, if it's a bit fast maybe. So they will be published, so if we take a look. So what exactly happened here? First of all, yes? SIGSTOR. Sorry, this is a Python specific client, but you have other ones. So for example, Cosign is implemented in Golang. Here it was SIGSTOR Python, yes? Okay, so how does that work? First of all, the client generated an ephemeral keeper. So it's still private and the public keepers, they are ephemeral. So they just stay in memory during the whole signing process, they never hit the disk. So you never have to see them again. They will get flushed at the end of the signing operation. Then the client will do an identity proof request to one of the identity providers we saw on the authentication login page. So it will ask Google, Microsoft, or GitHub, and others in some other configurations for a proof of identity for the signer. So this is where you log in and enter the proof you are the owner of your identity. And then the identity provider will send back to the client a JSON web token, which is called also an ID token, which contains the proof that you, the signed proof actually from the provider that you own your email address. Then the signing clients will send a signed certificate request to Fullshow, which is the certificate authority that contains the ID token, and also a certificate that is ready to be signed. And of course the ephemeral public key to be included in the certificate. So Fullshow also has a transparency log. So it's called a certificate transparency log. So when it issues a certificate, it logs it systematically into the CT log so that you can be able to audit for every certificate that was ever issued by Fullshow. So it's also app and only an immutable, so you can't modify any entries once they are here. So you can have like a full audit of every signing certificate issued ever by this instance. Okay, so it has signed the certificates. So now it sends it back to the client, and then the client will sign the artifacts of course, and then it will upload what you call a log entry into Recore. So Recore is the transparency log I talked about. So it's also apparently an immutable, and then you have a proof of the artifact signature during a given time span. Okay, so that was it for the signing part, and now onto the verification. So the verifier is also the same client in this case. So it uses a verification material I showed you earlier. So either only the bundle or certificates and the signature files. And it also requests an inclusion proof to Recore. So it asks Recore if it has seen this entry that was supposed to be included in the log. If it hasn't, it won't be able to verify the signature unless you specify that you want to verify offline, but this is more specific to air gap environments. Okay, so that was it for the workflow. Now if you would like to join the 6-store community, if you're interested in contributing, I would like to encourage you to join the Slack, the 6-store Slack, to subscribe to the YouTube channel, the blog, and to check out the 6store.dev website with all the community updates. So thank you, and now we'll go to the questions. Yes? So actually they're not really related. They're kind of different components. So the city log, certificate transparency one, serves as a backend to store certificates issued by full show, and the Recore serves to store signature entries. So it's two different components, but they are the same, actually they have the same backend. It's called Trillian. It's a miracle tree data structure. So it's used for the same purpose, basically. Yes? Yes? So in this case, oh, okay, sure. So I will repeat the question. The question is if it is the same certificate authorities that signs the certificates for full show, same backend, or if there is a federation of CAs. Is that correct? Yeah, but between different instances of 6store and service. Okay, so in this case, I talked about the 6store public good instance. So this is one instance of 6store that is maintained by the community. I think you also have a staging instance, but it's more for testing purposes. But in general, you can use this public instance, but you can also install 6store and bootstrap it on your own. So you can have a different CAE backend in this case. You can choose from other backends if you have your own infrastructure. Yes? Yes. Yes? So Cosign is used to sign containers. That's the first use case of Cosign, and 6store Python doesn't support that. And the goal of implementing different clients for different ecosystems is to support native integrations into projects. So you can expose an API on your library and integrate 6store directly into your projects. So the question was, okay, so the question was, wasn't there a risk that different clients will do, like will implement 6store differently, like the 6store protocol? So I think the community has been pretty good for thinking on this kind of issues. So usually you have normally a working group for clients. So people coordinate so that the protocol is still similar according to the implementation. So you have some kind of community agreement on what's supposed to happen here. Yes? Sorry, could you repeat? So the public instance runs its own. It's maintained by the community, but you can also install it on your own if you want to support a private instance of 6store. Sorry, I didn't understand the last part. Okay, so how am I supposed to know the correct identity for each package signer? So that's a very good question. In fact, I think you need to be aware of it. Or you can just, yes. I mean, it's in the certificate, but obviously you need to know what you're verifying. You can't just check any email address. So you need to know in advance what is the email address or identity you're looking for in this case. Yes? Which information? Okay, so the question was, is the information about signers provided on PyPI? So at the moment it's not implemented yet, so I don't know exactly how it will be in the future, but I guess it's, yes. We will see that when the PEP, 480 and 458 are actually implemented. I don't know yet how the API is going to look like. So any other question? Yes? Okay, so the question was, does it support having a single artifact signed by multiple people? So yes, of course. You can generate as many signature or certificate files as you want. So you just can store it in the same place, of course, but of course, yes. Yes? Oh, that's a very good question. So there's no line in this diagram between Fullshow and the identity provider, but in fact, you're right. So Fullshow recognizes only a set of identity providers. So you can choose if you have your own instance with identity providers to trust. In the case of the public instance, it's a community agreement as well, but you're right that, of course, Fullshow has a configuration and it specifies which identity providers it's trusting. So yes, that's true. Other questions? So the question was, if Sixthrow Python was going to support container signing? At the moment, I don't think so. I didn't see any community initiative in this sense. So I'm not sure at all. I don't think that's the plan as of now. Yes? So the question was, we should be able to sign containers with Sixthrow Python as it's also a file. So the thing is that you have the signature part and then you have the storage part in an OCI registry, which is a bit more complex and necessitates some other kind of implementation. So you need to know how to store the signature compared to where the image is stored on an OCI registry. So it's not something that Sixthrow Python is supporting right now. It's, what, sorry? Yes, exactly. So the question was, if the client uploaded the signature into YPI, so not yet because it's not yet supported but they're planning on it. Yes? Okay. So I mentioned I can revoke the signature, so I don't think that's exactly the formulation I used, but Sixthrow makes auditing easier in general because thanks to the transparency log, you can know exactly if you have, for example, if you know when an identity was compromised, if that's the case, for instance, you can know exactly what artifacts to revoke. So it makes, you can't really revoke artifacts, but you can know which artifacts not to trust and you can pull down from your environment. Yes? The question is, would RECOR be able to record the fact that artifacts were compromised? I think this is not the case, but I need to check again the structure of a RECOR entry, but I don't think RECOR has this capability as of now. I can get back to you on this. I have to verify. Okay, so we're out of time, so thank you for attending. Hey, everybody. So I'm Michael. I'm part of the CKI team, which runs the Kernel CI pipelines in Tunnel to Reddit. And I'm going to talk a bit about incident management. This is something that, as we run a service, it's very important for the user experience. Like, our customers are kernel developers and they tend to be crumpy, so we try to kind of like keep the quality of the service high enough. So to just give a bit more detail on the background, so we are a service team which runs pipelines. This was not something we understood in the beginning, so in the beginning we thought we developed the CI system. It took us quite a while to actually understand that we are mainly running it, so but the point of the system itself is that we want to prevent kernels from hitting the internal Reddit composers, but we also try to shift testing as far left as possible, so moving to integrate into the upstream kernel development workflow, providing feedback on patches on mailing lists, so whatever that means, right? Indubating into the upstream kernel development workflow as far. And we're also running, because we are a service, quite a bit of infrastructure to make this all happen. So we are running a main GitLab pipeline, so nowadays kernel developers sit on GitLab.com and they have a merge request and it's all kind of like normal. So they get pipelines and green checkmarks and all those kind of things. And we also hand testing off to Beaker. In addition to this, we run all kinds of microservices, we run stuff in EC2, in Lambda, we use OpenStack, we host our own messaging cluster. So that's quite a bit of infrastructure that runs and obviously is also able to fail. So if you wanna know more, I've linked the homepage, the code is on GitLab.com. You can take a look what this actually means. But yeah. So normally, on a normal day, we run this pipeline, we run our microservices and we deploy quite some changes in there, right? And because we run a service, we change it, product owners come along, have some requirements that need to be done real fast and then to basically there's some churn and that's also the possibility to actually break this thing. So what is actually meant by incident management? So if you would define an incident or the internet helps us as saying that an incident is an event that will disrupt the service, that's easy, basically it doesn't work anymore, but also that will reduce the quality of the service. So that means pipelines might take longer, sometimes it fails, sometimes it works. People have to click buttons instead of stuff going automatic. And incident management means that what we do in response to them, how we can mitigate them so that they will not be as visible anymore and how we can resolve them. And in the best case, we might actually do something to prevent them from happening again in a similar way. And now this is about small service teams and the reason I put that in there is that if you have a huge organization, you might have some dedicated teams that do this incident management for you. You might have really dedicated roles, dedicated side reliability engineers that take care of running it and handling this incident and recovering. But for these small teams, that's mostly just the same people that actually also develop the service. So I would do the talk in two parts. One is how to detect incidents. And the second part will be about how you actually recover from them after you've detected them, right? So yeah, why do we actually want to detect incidents, right? It's not just say necessary to detect incidents. Your users will normally tell you, right? It's a bit cynical, but it's okay-ish in some way. If this is good enough, if your users are maybe not as grumpy or they don't rely on it too much, they might actually tell you, right? Eventually, you should notice the biggest problems that your service has. But if you want to know before, or if you want to be able to detect the things that are not as obvious, you need to do some work. And that normally comes in the form of a monitoring and a loading setup. And so this is the first part of the talk. It's trying to detect these things as early as possible so that it is a relaxing thing to fix them because if your users haven't noticed yet, maybe they're still awake, right? Like an international team. You're sitting in China or in Europe and your customers are US-based. You have a couple of hours to fix them. So if you notice before them, you might just fix it before they ever actually see that something was broken. And now depending on how you built this thing, whatever you are running, it might actually mean that a lot of different pieces might fail in a lot of different ways. And that makes the thing also as complicated as it is. And the normal components of such a setup is basically you try to log, so you can check what went wrong last night. You have metrics so that you can really measure how long something takes so that you actually have some numbers. You can create some pretty graphs not only for management, but also for yourself to debug stuff. You can use something to collect exceptions if your code fails somewhere deep in the stack. Sometimes, it's really hard to see this in any aggregate measures. So surfacing these exceptions is one part of it. And then alerting means getting a pager alert at night, getting emails, having a Slack channel, those kind of things. And to actually make this happen because talking to different teams on board to some of these things, to others not, it makes sense to think of it in a way that it's really easy to onboard whatever you are developing in the future because normally stuff just aggregates. You get another microservice, another cron job somewhere. So finding ways to actually make this very systematic so that you don't have to do any work to add in another of these pieces will actually make sure that people actually onboard to these things. And so I would go through the pieces a bit. I'm not sure how many people are actually familiar with this stack. We don't see maybe who knows at least one of those. Two, still. Three, okay, it's good. Four, how many do you have? Five? Okay, that's interesting. And which part, for the ones that didn't raise their hand at five, which part is missing? Like which part do you don't, which part do you not know? You don't have the question. You don't have sent? I don't know. You know what it is, yeah. Okay. And father, what piece are you missing in your setup? Which is the most unknown one of the list? Anybody wants to say? Yeah, tracing is missing, but like from this list, is there anything you're not doing in your own service if you're on a service or which one is the most unknown? Which one? Like if you now take this list of these five services, which one of those do you know the least from all of this? They're all known. Now let's just go over them really quick. Let's start with Locking. So there are lots of ways to aggregate locks. The easy one to set up is Loki. It's basically an HTTP endpoint to send it locks. There's a client tool called Promtail, which pushes them, which is interesting and which makes this thing pretty resource constrained is that it's not going to index your locks. It's just storing them in the three buckets depending on how you configure it. But the point is that you index the labels that you put on them. So you might put the service name on them, but there will be no indexing. So if you're actually going to look for something, you will basically download all the locks during a certain interval from the S3 bucket and then go through them locally. And that makes this thing pretty simple in some way because there's no index database that needs to be kept in space. It's really just an S3 bucket, right? It's basically a block start downloading them. And pieces you want to put in there is if you run Kubernetes lock files from pods, if you have ground jobs, you need to figure out a way to, you can tee into them the standard output. If you run nodes, the journal is kind of nice, but this is for example, one of the pieces we miss. But if you want to know what happened on this node before it went down, kind of interesting. And if you run stuff on AWS, any of the hyperscalers and actually getting the locks from these systems in there as well, it's also kind of nice. And so one other thing next to just being able to debug these incidents, one other thing that locks give you, at least here for Loki, but I think also for other systems that you can actually alert based on what is in them. So some weird issues you might only be able to find by some message locked somewhere. And so Loki allows you to actually say like, oh, this message is in there, a complaint or send an email or something like that. And yeah, one way of applying these things if you run Kubernetes is trying to figure out how you can put this into whatever you use for Kubernetes, YAML templating, Customize, Helm, whatever. So that basically becomes really easy to import the next Microsoft here. Promises is for metrics. Why do you need metrics? They allow you to measure stuff, stuff that's not obvious like duration, the number of jobs, those kind of things. It's basically a time series collection system that puts labels on stuff. And it just takes a text file from a metrics endpoint. So this thing is as simple as it comes. That's the reason why it's so popular. There are very few data types. You can have counters that only can go up. So these things can cope with restarts because then the counters start again at zero. So it's possible to unwrap these curves. So that works very nicely in a Kubernetes context. You have gorgeous, however you pronounce it actually, they will vary. So it could be something that you measure continuously. There are histograms if you want to care if you care about distribution. So that's basically it. On boarding this thing is pretty easy. So one you can in Kubernetes scrape all pods. It takes a bit of configuration to do that, but after you've said that you deploy a pod and it will be scraped. And exposing that in something like Python is all of four lines. So it's importing the package, defining the metric, doing something to the metric and starting the HTTP server. This thing will start up with HTTP server. You put it into your pod description or like deployment description in the service definition. And then you can curl it and it gives you what's shown at the bottom. So it's basically here in this case, it's a counter. This is basically it, right? Like this is all there is to it. This number will get ingested and Hormizas will hit this endpoint again and again in a certain interval and this will build your time series over time. And then you can make graphs, right? Like that's why you do it, right? But even if you don't have to graph, so like you can create these rules at the top in the beginning it's weird, even after a while it's weird, but it gets easier slightly, right? So there are ways of aggregating and across different instances linking stuff together. So this has a very powerful query language and you can define alerts based on these rules as well. Now for the exception handling, that's an interesting one because normally if you have code deployed and you don't handle exceptions, the normal case what you try to do is kind of like do something, log something, go on. Which is the right thing to do because you don't want to have your service go down just because there was a small problem. But that kind of like gets these weird things lost. So Sentry is a system that basically hooks into your code. It can hook in web front ends, Python, Golang. I'm pretty sure there's an SDK for nearly all programming languages available. I'm not sure about Bash. And you can see these things develop. Sentry will collect these and aggregate them. The only thing in Python is one line of code is sorting this thing up and configuring it. And it will collect all the information. Yeah, a lot of information, it should make it easier to debug the problem you had. So in this case, for example, it shows one of our things and at the top, so you see a list of exceptions that happened during the service, right? Like it's not stable. And then you get these little graphs that show you how often this happens. Here in this case, in the last 24 hours, there's something consistently wrong with this thing. And it shows you the message of the exception. Now if you would click through, you get the stack trace, you get the variables of the stack frame, you get HTTP access that you had before. So normally that's good enough. The most important feature is that you can assign them to somebody, right? Like you can hand somebody in your team that needs to care and then you just assign them to them and hopefully they will care. And the last piece is alerting. So if you find any issues, especially around home issues, you have some metrics, some SLAs, whatever your service level agreement is, you need to, at the end, alert, depending on the severity. So one thing is you can send emails, but you also can send stuff to a page on. So, and then this is what it looks like, right? Like there's a web interface to it as well, so you can basically go there. You can silence these alerts so after you've basically gathered them, you have found somebody to handle it, then you can silence it so that it does not spam out. So this is kind of like a pretty normal alerting stack. Monitoring stack, but yeah, okay. Now you know, right? The next problem becomes is like, what do you do actually, right? So you found issues in your code and now, right? Now, if you tell this to an engineer, normally what you get is basically this. And that's eventually for sure what you should do, right? But yeah, the point is a bit like, what exactly do we wanna fix, right? So they are both technical and social components to handling incidents, otherwise I wouldn't be having this talk here. So one thing is that you want to fix the immediate problem, right? Like if it breaks, your customer, you want to unbreak them as fast as possible. You also want to fix it properly most likely, right? Like something that needs to happen in a certain place to kind of like do a real solution, not just somebody logging into some machine and changing a config file, those kind of things. And if you're really good, you will actually find the root cause, improve on the problem that was actually causing the incident in the first place and do something to it so that it will never re-cure. Now, the social problem is who does these things and do you actually do all of them? And yeah, as I said, right? Like if you're a small service team, there are no people dedicated to this handling process. It's a team responsibility, which normally means nobody does it, unless somebody is kind of inclined to like doing these kind of things. And normally, or what I know from our own team, from other teams in our proximity, the person to fix that is basically the senior engineer that knows how the system fits together. So that's basically the one that knows exactly what to do, will do something real fast, might also do a proper fix. But if this person is on PTO, you're kind of out of luck. It's really hard to learn from this person as well, because basically the person does it, right? Like there's no visibility into it. And then, depending on how you stable your service off is, it might actually mean that this person actually could burn out from having this on their shoulders, because it's the only thing. Yeah, so the question is, does this process include how the work is spread in the team, if I understood, right? Dispatch. Is the visual care at the worst technician fits above, or is it somebody that does it? Yeah, so is it sent to somebody, is somebody just picking it up, or is it actually delegated to somebody? And normally, in these small teams that I've seen at RETAT, for example, is that basically there are a couple of people hanging out at Chat Channel, and a load comes in, and then there's somebody carrying, and they pick it up. It's not, that's the bad example, right? Like there's another one coming, right? But this is something that I've seen happening, and which actually works, and it's very often the case if there's no formal process around it. So yeah, you shouldn't do that, right? Like maybe I should have put this on the slide, so our next version I will put something on the slide, like don't do that, right? Like this works, but not recommended, right? Yeah, that's something we figured, and so we thought like actually how to come to a better process, because having this in place will hopefully reduce all these disadvantages on the previous slide. Now, if I say something about the process, people get all blurry in their eyes, right? Like this is not something that people or engineers enjoy too much, like talking about the process, and it all feels like bureaucracy, and complicated, and management, and those kind of things. So we created, we tried to come up with something really simple that would actually be something that engineers do instead of just something that's on some website. But yes, the first thing you have to do is create a ticket, right? Like it is that bad. It's pretty easy actually to create tickets on most Gitforges, so you can press a shortcut, but this is the first thing you have to do, and also one of the main building blocks is that if something fails, create a ticket. So if you ignore all the rest of the slides, this is one thing, because otherwise you will not have a conversation about the incident, and there will be no place to learn for any team members. There's also no place to delegate, so if you want to kind of hand it off to somebody, you need a ticket. You can post screenshots in there, most Gitforges allow you to have confidential comments if it's a public tracker, which is really highly recommended. And then do it in a structured way. So that's the most advanced figure that's going to be in the slide. So these are the cases. So normally you have an active incident, something is broken, something exploded. And the first thing that people normally do and should do is try to reduce the impact to give you some breathing space. You will figure something out, right? Like normally people have some idea what's going on. And this can be the most senior person in the room doing something really quick, right? Like that is quite acceptable. And you will get to this place where it's called mitigated, which means that somehow your customers might not know that much anymore. And then you can work on solving it properly. So it might be you're using GitOps, you need to do something, you need to change some Python code, go through review cycles, all those kind of things. Do it properly, right? Like this would be the second step. But normally there is this first step where you kind of like get it fixed really quick. Maybe it shouldn't be that way, but let's be unrealistic, that's mostly what happened. If you get to this place, the ticket can be resolved. And now comes the interesting part, right? Like because just because you resolved your incident doesn't mean you're done. So the last part that mostly gets ignored and why we designed this process in the first place is that there's more work to do next to resolve in the incident. And that's mostly improving on the road course. So there was something that caused the incident, not just that it would actually explode it, but that it was possible to explode in this way in the first place. Sorry? Could you read? Do you mean to get this after a month or three? Yeah, that's the thing, right? Like there are teams that basically create a new ticket and then it gets put in the backlog and then it might disappear in the backlog. But the recommendation is to not do this, but actually keep your incident open as long in this resolved state. That's what we do. Until you actually prevent the reoccurrence of the incident because otherwise you accept the fact that it will happen again, right? Like this is basically what it is. If you're not improving on the road course, you accept that it will happen again. And now we give an example, something that happened three weeks ago. Spot instance got more expensive on AWS. They increased your price. They hit the on-demand limits, on-demand prices. And the tooling we use is stock or machine. You need to give it a limit, like a price limit for how much you wanna pay for the spot instances. You need to configure this because the tool requires it. It has a default that's kinda weird. So you need to set a default and set a limit. Our limit was to low. So we didn't get any spot instances. That's always beautiful, right? Like GitLab jobs did not run anymore. Okay. So the first fix was to secure Shell into the GitLab runner to change this variable in the conflict client. All right, like that works, right? That's just like BIM, regular expression replace all. Boom, they spawn again, right? That is going from active to mitigated. The second stage was to configure it properly in whatever GitOps solution you do. In our case, it's a deployment repository. Pipelines, reviews, configuring it properly. It deploys. It's still open. So if you click on the link, I created the ticket this morning, not three weeks ago, so that tells you something about how much we stick to the process. What is still open about it is actually that Docker machine should not require you to specify this limit because this comes from a time when AWS was actually bidding. Nowadays, spot instances are not done with this bidding. They have a price, you take them or you leave them. They're also limited by the only month. So the root cause fix that this will never happen again is basically remove all these limits that you have in your company. And now for the last minute, five minutes, something, I will ask you about something. I don't know how good you, how many of you are retellers? How many of you read emails? You have enough mailing this internally, but anyway. So there was an aside in some company where secure certificates were renewed and before the SSS certificate. And before that, the SSS certificate was issued by an external CA and the renewal was done, but the CA used was an internal one. So basically any customer needs to have these internal CA configured in their system to connect to this site. So this is basically what happened. It broke all the customers that internally needed to connect to this site. And it surfaced on a mailing list. This is just an hypothetical example. And now, I don't know, the game is, what would you think, what would be the fix or what would be the actions you would take as a team going from an active state where basically somebody complains on the mailing list to closing the incident ticket. So that's the interactive part of the question. Secure shell to the server or ground server both would give me a link scan for some of the data. Yeah. Okay, so the answer is secure shell into the machine and use third bot. I can tell you it doesn't work because it was a service that's not secure shell accessible. Next one. Yeah. Who was the user that showed the program that says from this site? So the answer is to roll back to the old certificates. And that was exactly what was done. They were all back because they were renewed but the old one was still valid. They were from the public CA, so it would restore access. And so doing this would mitigate the incident, right? Customers are now able to again connect to your service. It's mitigated because if you don't do anything else, it will break when the certificates expire, right? There was a reason for the renewal. So what would be the next step that people would need to do? So the answer is use the right CA and create new certificates. Did I understand it right? No? I don't see if you can provide the same to what it is. That would be one way, yeah. You could provide the internal CA to all customers, yes. So yes, that would be one way to resolve the incident, track down our customers and give them the internal CA. What happened in this case actually is that they rolled new certificates with the public CA, right? Like because they could do it the first time around, so it was actually done the second time. So basically you renew it correctly. So that would move it to resolved. Now what would you need to do to actually close the incident ticket? So that it will never happen again. Policy. What do you? Policy. Policy? Right, policy related to this. So the answer is to write a policy so that people do it correctly. Okay, what else could you do? Write automation. Automate, yes. Write automation that these certificates automatically renew it so nobody needs to touch it. Yeah, that would be second part. What else would need to be done? What else went wrong? Monitor. Yeah, so the answer's at monitoring because here they only found it after the users complained on the main internet. So there are like three steps that you need to do after you resolve the incident to actually close it. Yeah, and so that's basically all there is to the, so these are the answers. So this is all there is to the whole process. If you take anything away from this talk, that is that if you resolved an incident, it's not yet closed. There's work to be done. And if you don't do this work, if you move it to another ticket and put it in a backlog, you accept the fact that it will happen again. And there is a social aspect to the whole thing. And you need to account for this. So if you have faces, there are faces defined, they make sense. They are necessary, you shouldn't skip them. If you have faces, you can delegate them to other people, right? Like so the senior person can respond and you can handle it. So there's learning involved. You can track them, but make sure that you don't drop these issues from your view, because otherwise, next year when these certificates come up again, you might make the same mistake. And yeah, this thing actually works. So the process, that's something that where we have a Kanban world and yeah, we move these issues along. Yeah, and now if you tell us that there are a lot of active incidents, yes. But yeah, it is totally worth thinking about these processes, defining something and trying to surface them and track them all the time. I'd like this, this is one of those tickets we have. It's basically an example. You see, there's this stuff outstanding. It's a resolved ticket, but it still needs work. And it's still very annoying to have this on this board. So that's kind of the point, but it should annoy you. The process should be not, it should make things visible. Okay, so that's it. Do you have any questions? Probably to the social aspect. How did you convince everyone that even this was important? So the question is, how did we convince people that it's important? So we have something called request for comments. So there was a process agreed. And I think in this case, we didn't write it down, but it basically was a problem. So we had a problem that these incidents needed to be handled and we needed to involve the team. Because one of the things that comes up here is that if you as a senior engineer just fix this stuff, you will not fix it as good as if you talk to people in your team. So that was one of the reasons that basically we tried to implement this, that it becomes visible. Did we convince all of them in the beginning? No, are they now convinced? I'm not sure. But yeah. Yeah. So the question is, do we have SLAs or SLIs for actually, yeah, an SLI, like an indicator of how we are doing related to these incidents? And no, you could create one out of these tickets. So for us, the main thing was that they would actually be visible and that there wouldn't be too many tickets in this resolved column. So this has happened. But yeah, it's, as you see in this case, so we switched to this process a couple of months ago, you have a lot of tickets in the active column. These are kind of like weird issues that sometimes happen that are hard to track down. Yeah, so no, we haven't done it. That would be the next thing to do. But just having an SLI never really gets the work done as well, right? Like it's this agreement of a team to work on it that is more important than having a number symbolizing it. Is this part of that eventually? Yeah, yeah, would be, so the question is, yeah, using an SLI would actually allow you to figure out how you're doing over time. Yeah, I totally agree, would be nice to see this. I would be too sure what would come out of it, right? So it's not like an assignment to the auditor. Yeah. So the question is to, is there a priority to incidents? There's no priority between incidents, so incidents are all labeled very similarly. The priority in some way comes out of this where they are on the board. And of course, it would be also kind of related if there's a, if there is one, they don't trust what the matter. Yeah, so it's related to, I like what you agree on. But normally incidents are things that should not happen, right? Like, it's, I forgot the quote, but somebody said like, if you don't hand the incidents correctly, you are basically breaking the promise to your customers that you're caring about their experience because they tell you or something breaks and you're not making sure that it did not happen again, which means that whatever you do instead, you consider more important than whatever broke their workflow. So that's kind of like, now obviously, as you see on our incident board, there's just things in there. So it's, it's hard, but this is what it comes down to. Yeah, so what do we do? It might be a real question. Yes, so the question is, do we have any knowledge base to prevent reoccurrence? So we have something, operations manual where we put instructions on how to fix things. But the focus is really on preventing these things from happening again in a structural way. Most of the times there's something you can do to prevent it. Some architectural thing, something you, I don't know, you need to change two pieces instead of one, right, like, and then sometimes you forget to change this other piece and then it explodes, right, like, that's an architectural thing. So moving to one source of the configuration so that it basically gets used in both places would be the fix. This is a bit of a current. There is a stack, helping to detect and protect from continuum in and from, because as the girl mentioned, we out of the monitoring, it's from one side, we can add monitoring, yes, but this is just the third. Incident would happen more and right now I know only one step which would help us to do this. Grapana on full and Grapana incident and this stack continuously attached to these three members and can create, even gerologics related to this and he come in town, issues and problems. So when we have her incident, her entity, we know that he was previously and can also get some information from previous. Why is Grapana on the position? I understood correctly, there's a Grapana stack that can be used to kind of like link incidents together and link them to. No, because many are going to help to do this in their other stack. Okay, okay, so thank you very much for your attention. I hope you enjoy it. My name is Miroslav Suki, I come from the Red Hat. I was working with SPD Star, stuff for about a year on the site project. And from that comes like my interesting in the software build of materials because I hit it quite often and I wonder what the hell it's tests. So before I start, like how many of you heard about software build of materials before this presentation? Okay, okay, almost everyone I guess and how many of you know what software build of material is. Okay, we have two experts that know more than I do, so I guess you check who raised the hand and then ask question them, not me after the presentation. So I promise to deliver you explanation very easy term. So promise me if I will not be precise, so just for the dummies for the entry level, as I am, am I. So the software build of materials is very easy. It's basically at the best analogy I heard is it's a list of ingredients from which the product is made of. This describes you from which ingredients the product is made of. So what anyone want to guess what it actually is? And usually the order is from the most important material the least important materials in the product. Very close as shampoo, so that's what you show your product for the first. But it doesn't show you and doesn't tell you what or how the product is made of. So it's not the recipe. It doesn't tell you whether you put everything in one pot and cook it together as I know on top or you do that in some precise steps. So the software build of materials in IT world give you this picture. Nothing else, it doesn't tell you how to handle this teeny part, this problematic part. It doesn't tell you what the big blocks are or like how to work with that. It just give you this picture and then it's up to you. This is actually maybe the biggest problem of the software build materials that it. It's not solution, it gives you just one thin layer which you can uncover and under that is a huge pile of other problems you may need to solve like what to do with the dependencies, what to do with the vulnerabilities, how to replace something. So it's whole or a bit whole into other stuff which you previously didn't knew about, you didn't care about and now probably you should care about it because security stuff, et cetera. So it's just map of your product, no solution what to do with that. Back to the food analogy and to show you like what can happen with your product and how the software build materials may help you. This is my favorite sauce from my local supermarket from the Lidlo and I was hit by this several times because you bought it, this is nice tomato sauce for pasta, nothing on the ticket that may warn you. This is the ingredients or it's in check and you begin start at the most important stuff so it's tomato, tomato sauce, then some onion and that's it and some silly stuff at the end. But the silly stuff at the very end is the chili and it's actually so spicy that my daughter can't eat it. So I have whole pot of the pasta for myself. It's very, very spicy and it's not set on the front page like hot Italian pasta. So you have to have the build of materials, the ingredients to actually find and the very same thing goes for the software build of materials only when you see the big picture that XKCD diagram you can find, okay, this part is probably very at the bottom, very tiny but supports all these big blocks and maybe problematics and maybe do something about it in future. So how the software build of materials actually looks, it can look like this. This is software build of materials for some virtual artificial made company. It just lists what you are using in your project, in your company, it's just not too interchangeable. Like if you send it to other company, they can't merge it, they can't process it too well. So it doesn't work very well in our current distributed world. So we probably want to something else and around year 2020, 21, we had some accident, famous SolarWinds, Microsoft has some issues as well. So UK, US have Keeper Security Executive Order. We said, you have to use software build of materials. It will help you to auditing, give you the map what you have in your system. It will help you find vulnerabilities in those teeny boxes at the bottom. I think can help you finding the licensing of the project because again, some teeny box at the bottom can have strange license which prohibits everything at the top. So since that, it becomes a thing and people start caring about it because they were forced to do and it's actually good thing that they were forced. So the Keeper Security Executive Order said that the software build of materials should have at least these fields, there are other fields, but they can be optional. So you have to have supplier name, component name, and et cetera. So some fields and from that moment, we have two not competing standards. We have actually two quite existing standard. One of them is SPDX standard. It actually started around 2011 and it started as licensing auditing tool. So the origin are around licensing and licensing management, but then grew up to the full software build materials management tool. Then we have SWIT tags which actually is not actually software build materials format specification, it just allows you to identify component. And last is Cyclone DX which is more recent stuff. It comes from the Dev Ops origin and it's very lightweight and focus on how to describe which component has which vulnerabilities and whether it is good for you or whether it should upgrade some components. So they have different goals, different audience, and both SPDX and Cyclone DX is fine. So if you are interested in your project, either one of them is fine. How the software build materials actually looks. So this is example of the software build materials, the actual SPDX document. And if you notice there's some header preamble and there is something which you describe the component and that's some package name hello. And there are some information and if you come from the Red Hat Fedora world, it may reassemble you to RPM preamble section. So there are most things from there. Then we have the list of the files, again something which you can easily page with RPM-Q-List and then some identification. So actually if you are using RPM or any other package manager, this is very easy to retrieve with few RPM-query commas and within 15 minutes you can have software build materials. From my point of view the most problematic part is this one and this was the license. And in this case it's a license declared MIT which means the upstream, the author of the package declares that it's under MIT license and license concluded means that I didn't care about it like no assertion. I just took it and passed it and I didn't try to do anything about it, find it or it. So this is something which we are trying to change in Fedora now because right now we can even do this because in Fedora we are using the old system identifier, we call it now Callaway system because it originate from the Tom Callaway who was a legal guy in that time in Fedora and he invented identifiers for the licensing. So he said, okay, GPL version two should have identified GPL v2 and that's it. But it was no standard, no one actually used that but Fedora. So now we are moving to SPDX List from the standard and we are actually trying to audit it again what licenses the packages is using. And this is where you may hear about SPDX in Fedora. I'm sending every two weeks a statistic how we are going with the conversion to the SPDX identifier. This is current burn, not down but burn up chart and hopefully next year after summer we will be finished if we do something miracle we may be even faster. Right now, I and a few other people are focusing just on the licensing ID. There's a guy from the products security who works on the software build materials itself. For the first time in his story we have the data about licensing in machine readable format because previously in the Callaway system it was just in the Viki and it was just described in HTML page. Right now we have in JSON and in Tomo files the license and attributes whether they can be used for anything or just for data or fonts, et cetera. We have even the formal grammar so you can build your parser and say what is good and bad. One interesting situation is that we changed the no effective license evaluation which slightly complicated this migration and that's actually why it is so slow because that means that previously when there was a license and there was a, that's a real situation there was a package, one parallel package which say this package can be licensed under any open source license and when you come with that you say okay I'm choosing GPL version two and put it as a license field in the package. Right now you can't evaluate or the guidelines are that you should not do that and you should actually say that it's and that there was the longest string in the, from the all RPM specs so you should GPL version one or GPL version two or GPL version three or what the fact publicly sense or MIT or FHG and it was 800 characters long. We are right now in the process of getting the, because the SPDX is, it's not so young but it's definitely younger than the Fedora so Fedora has a longer list of the licenses we have, then the SPDX have so some of the licenses are not in SPDX list so we are trying to push it there very hard a lot of licenses has been recently then added because of the Fedora. In case there is no license in the SPDX which for some reason don't want to add it there you can standard use license ref dash something and it means that's your own license and it will never be in SPDX list and it's up to you to describe what's the meaning of that license. Now, you may think that okay, thank you Mirek you explain me what's the variable of material it is and now I'm supervised and that's for me and I don't need any other information. This was just crash of that thing and what I describe and what you actually thought about is the analyze of variable of materials so the list and document which you created when you are building package for RTO container but you can have various types of the build materials for example the build so you may describe what GCC is used during building of the package, which version was it? Was there some vulnerability in that time when you were building the software or you can use the source stack like was it mostly on the VLA or VHAP? In the time when there was some insecurity incident or if you may example or otherwise were on or the other round down so deployed so is your software deployed together with some other software which together may cause some problems or what's actually needed for the runtime so your software is using storage SD buckets from the Amazon so it's not described there and it may be problem so you can have runtime build materials which try what's even this relation and dependency on SD buckets and you can be Amazon version as the or something like the original small shop provider which may have some vulnerabilities and it depends what you are actually using and there are other stuff you can't imagine here because there will be no more easy and no more for dummies so what do we, so what do we I will conclude it with that what do we learn that learn that SBOM is just a map it's list of ingredients what makes your software that we have two parallel standard one from the SPDX from the Cyclone DX both are accessible that bill of materials actually easy to generate if you are using RPM if you are using containers made of scratch from GitHub deployed directly then you may have some problems but the licensing from that is the most tricky part likely and that we have various SBOM types and the rabbit holes go deep down that's it any question for me so the question is what will happen if we use two pieces of software one with BSD one from the kernel which is incompatible so I may answer it for Fedora but I may not answer it regarding the software bill of materials because software bill of materials don't care it's just a map so you may say okay I have this whole deployment and I have one part of BSD and one part of GPO version two it's fine or if it is some proprietary software or some hidden secret and no one knows no one sees you fine the customer may be fine or not it's just a map so it doesn't tell you what you should do it's up to you and in Fedora I don't know if it is different component then it's fine if it is linked together then it's not fine and it's more question for Legals and I'm pretty sure this answer will not be straightforward I don't know so for the record so what I consider is most important for generating S-bomb that was the question right oh I'm not sure whether I can answer it because I'm coming from the licensing part so the S-bomb for me is some some additional part which I was like just curious what the hell it is so I'm not directly working with the software bill of materials documents so I don't even know whether SPDX or Cyclone DX is better but what I find interesting in one discussion is that and that you should have tools that generate it I'm not even thinking about like that people should generate it that should be fully automated and there is one initiative which tries to provide a tool that the S-bomb will be added to every software project how you can do that because and when you retrieve it together with Tarbull from the upstream and you retrieve this software bill of materials you may or not trust it and if you don't trust it and you want to validate it then you should have a tool which actually generate your own bill of materials and then you can compare it and see whether it's valid or not but then if you have the tool which you can generate it on your machine then you probably don't need to software bill of materials from the from the vendor or the upstream so this is interesting situation but definitely yeah everyone including the author including the vendors including the customer should have some tools which can generate the bill of materials and they have the should have the same output this is for me the most important part of this okay so the question is what the hell is this graph so this is this show our migration of the SPDX licenses from the old Callaway system to the new SPDX format the blue one show like this start it actually start in December of last year this is the point zero and this blue one show how many packages is already converted and this is estimation how with this space how long it will take us to the one hundred percent and this is this yellow part is how many how many trivial conversion are available and I'm in a group with two two lawyers and they probably hate me for just saying it's a trivial conversion because of the audit is not straightforward because they're not evaluating licensing so you should evaluate it but it's trivial from the point that right now the license in the old Callaway system is for example GPR version two and we have only one identifier in SPDX format which X will equals to the old one on the other hand this red part that is terra incognita because that either means that more options are for example old Callaway system had a BSD identifier for BSD to close and BSD three close and in SPDX we have to choose which of them because they have different identifier with MIT the situation is even worse because MIT in Callaway system represented I think eight or 10 SPDX licenses and it may hide even some licenses which don't have the SPDX identifier which you have to apply for and it may take a week or two months to actually get it from SPDX and get it to federal license data so if you want to work on that you should probably start quite as soonish. This is all packages in Fedora like packages in Fedora doesn't have this SBOM related to as far as I know in Fedora we generate the SBOM for the containers where we take the container as only one part and then one file so as far as I know we don't dive into the container itself so it's super easy bill of materials. Yep, so the question is why the trivial conversion are not really converted by automation? Because I'm in a group of four, two lawyers and I'm one of two engineers there and all other things that we should evaluate it manually because we at the same time change the meaning of the license so previously we can evaluate it, the license now we should not evaluate it so the license string in some cases actually change even if it would be conversion from Colovay system to Colovay system because you add more licenses with some operator license one or license two for example and sometimes the license evaluation was done pretty long time ago and it may not be true and it actually happened, it's not so rare like last week I was trying to convert the RPM itself the RPM package and it was not straightforward like RPM web set this GPL version two license the copying say this is GPL version two with some exceptions for RPM IO and RPM leap and license string in RPM header said something different so we have three cases and the issue is still open in RPM GitHub and I'm discussing with Panum, Miro Harenchok, Neil like what's actually the final concluded license so it's not straightforward even in those trivial cases so it's hard, trivial is not so trivial so the question is whether there is problem that we have two standards and whether we can convert from one to the other I don't know, I like this but I don't know I never work with this format directly so as I mentioned I'm coming from the license part so I was just screwing what's around me around the license part so I don't know so the question is can you ask our federal package maintainer help? Yes, yes, yes, yes you can and please do because you know your package well and you know like whether there was some old system like the RPM which is 20 year old even more or whether it's new and like what the upstream say is true or you know which files are used there and whether it's really trivial and you can convert it from GPL v2 to GPL dash version to dash 2 and that's it and you have five second job or whether you audit it and know which files are there what can be the problematic part so you know it very well and we have tools like license dash Federa 2 SPDX which can help you convert the strings there are other tools like scan code, Ascalon of CLI, license check which can help you audit the files so this is where you can help and we are organizing workshops so if you hesitate with something we can help you but you know more than we know so please help us. And the ultimate goal where we are doing that in Federa converting like the material is things SPDX licensing strings are other things and any other future software which builds on auditing on management of licensing will build on some injustice standard and that seems to be the SPDX especially about the licensing so we don't want to do something for just a moment which will handle the Tom Callaway system so we want to use something new and injustice standard so other tools somebody else will use use the new standard things and we are there as well as Federa first other question okay so no other question if you will have any later find me if you want to help with conversion of your package to SPDX identifier let me know and I will do my best so thank you Are we live? Are we live? Yes, you're live. Perfect. Welcome everyone and first thanks for the DEF CON sponsors to get us here today I'm here with my colleague Alessandro we are part of Red Hat side reliability engineering team working on managed services with a twist which we come to later and our talk is about how we manage to retain some of our sanity in what we do and show what we've built over the last past few months so Alessandro please Okay, so my name is Alessandro I'm a senior SRE at Red Hat and let's start and let's start with our talk with an example application that will help us through the whole talk we have the three standard layers like UI API and DB again here we have some an example of the resources we may deploy on a Kubernetes cluster but this is just an example there should be more so let's focus on the three green one so two deployments for UI and API and one for the DB it will be helpful later and let's assume you already have this beautiful application running on your Kubernetes cluster and everything works fine life is good but then at some point the dev team wants to deploy an update for your application right so you as an SRE you don't have already big processes in place so you're just started and you get from the dev team you get the good old folder with some YAML files containing the deployment and the state full set for the database you look at it it's all fine so you accept the challenge and you deploy it so what do you do you use your beloved kubectl command you apply the whole folder and after a few seconds everything seems fine so you're like very happy about how cool Kubernetes is such what a beautiful tool you're using but then you're a smart guy so you go and check did this actually work so what do you do you go and check the deployment and the state full set using your beloved kubectl and you see that everything is running fine so now you're super happy and you go grab your coffee but then a few moments later when you go back to your laptop you look at your slack icon and well it looks like this and you go and check one of the chats and you have your annoying dev that tells you hey man the new UI is down so what have you done and then panic starts and you go and look and what the hell I mean the pods for the UI are all crash looping what's going on so panic keeps going but then you have a moment of sanity and you remind yourself that you can undo what you just did right so you regain your call and you just undo you just basically roll back the deployment for the UI and you think that everything is okay but then after a few seconds then you're panic again because it's still crash looping so what the hell is going on and you write to the same dev in the Slack chat and the dev tells you well the ODI don't work with the new API because we make some breaking changes there so you're still panicking but you think I can think about doing it again so roll back also the API deployment hoping that the database have no changes right you try to do that with the same command before and it rolls back actually rolls back but at this point you regain your calmness because you see that everything is running fine so luckily the database had no changes so rolling back everything solved the problem temporarily this is a typical day in the life of an SRE so things keeps breaking randomly and you don't know why but let's try to see what happened the events that brought us here so we received some YAML file from our dev team we trusted them and then we used our cube-cattle tool but in the end we could have also used Customize or Helm to do the same thing and those tools, what they did they just pushed resources to Kubernetes and let Kubernetes do the reconciliation I mean I said just because that's what they are designed for so they worked as expected and now I have a question for you which is which step is to blame for the failed update? The developer? Yeah The YAML you trusted Exactly well no the YAML were fine but things keep breaking for some reason so we don't know why the content of the container may have been broken but the reality is that in this picture none of these steps are to blame for the failed update so we actually have a problem from an SRE point of view because our tools are like well I mean we did our job so don't look at us for the failed update okay it's not our fault but in the end as SRE again the update failed right and the update could have caused our SLOs to fail as well and the problem is that the SLOs are the actually actually I think that matters in the end for an SRE right so we have a problem but we as an SRE our mindset is to analyze the problems that we face and try to turn them into ideas to improve our workflow our tooling blah blah blah so let's try to analyze the problem we just had first of all we have loosely coupled objects every object that we push to the Kubernetes container is loosely coupled so what if instead we have something that allow us to bundle those resources together in one package? Spoiler second point is the state to solve all these coupled objects are distributed and heterogeneous so it's hard to understand it's hard it takes time to see if the deployment succeeded and all the other resources so what if instead we had an aggregated status for this bundle of objects that tells me if the update actually worked right the third one is all at once roll out so you push all the resources together at the same time and then you let Kubernetes do the reconciliation but sometimes this is not enough so what if instead we can have we could have an incremental control roll out of those resources right so for example we try to deploy the DB if it works then we deploy the API and if it works we deploy the UI right and the fourth and last point for this slide is complex rollbacks because we saw that the deployments have an easy mechanism for rollbacks right but what if we could have the same concept for any type of resources that we deploy to the Kubernetes cluster that would be very nice right and now I'll hand over to Nico for the second part yeah so I'm now here to tell you why we can't have nice things in what we do because of scale and compliance right scale yeah everyone on scale and compliance boo right and if we take the ideas that Alessandro already introduced and tried to turn into solutions we had some excellent talks during the last days talking about things like hey we need to bundle our resources together right so we want something like GitOps right you put all your stuff into a repository and if you need to roll something back you change something in the repository and magic happens and it appears in production if you have a problem with status right you don't know what's going on you have monitoring at least I hope you have monitoring right but there are solutions for this right and even with four more complex processes like incremental rollouts you can rely on ArgoCity or essentially Piggy or Poison right there are plenty CICT tools out there who have plenty instructions on how to set up sensible rollout strategies and in the end for rollbacks right you can make backups Valera is a nice tool for Kubernetes backups or there was a nice talk here a few days ago about Argo rollouts which also includes automatic rollbacks super nice so can we have nice things well yes but actually no so all of these projects are super nice I highly recommend them but I can't use them and well first because of the scale that we're operating at my team we are not operating with like tens of clusters or a few hundred clusters we are tasked with operating across thousands of clusters and closing in like ten thousands of clusters admittedly not all of them have managed services deployed that we take care of but still all of these clusters need to have the ability to install managed services at any given point in time right because in the end our customers they don't care how we make the magic happen in the end they just want their push button deployments right they swipe their credit card get their quota and then they start installing stuff and they don't care that our backend infrastructure has a hard time coping with like thousands of clusters but this is where stuff becomes tricky because all of the tools that are commonly available they don't quite work at this scale and then there's an even bigger problem those are not our clusters those clusters belong to our customers they run in their AWS their cloud accounts right so just music stops because this is a big one this means well one does not simply send data out of the cluster right we can't just take whatever data we want and ship it outside of the cluster because even such simple things as the name of namespaces in the Kubernetes cluster right it's just like a folder name in kube they might contain sensible information about a customer's next big project right you can't just take that and put it into our management systems we also don't want to install random open source projects there like our CD is not random we can't install it on the customer cluster because the customer might use Argo already we also have a problem because we can't grant ourselves arbitrary permissions because some of our customers get really upset if you hand over the keys to the kingdom if you will and give everyone permissions and we can't go around proxies either right so we have a few ways to communicate with those clusters and otherwise they might be really isolated so in the end one does not simply walk into Mordor if you don't own the place right but we that's essentially why we couldn't use available open source tooling and instead set off the venture of well building our own thing right and Alessandro will now show how we destructured the problem into smaller chunks thank you thank you Nico again so if you carefully read the title of this talk we spoiled the name of the thing we built for this and it's called package operator so now let's try and see what's like a very high level overview of what's inside package operator and how we could try to help us solving the problems that we highlighted before so as I said back before like deployments are nice because they have some characteristics that help us dealing with pods in this case but what if we could have deployments for whatever we want to get to the cluster right so I think I don't have focus on the window anymore sorry you stole my focus yeah so what we came up with in package operator is a concept similar to replica sets so replica sets handles pods right what we would like to have is a replica set for whatever and we came up with this object set resource that is part of package operator again and which is able to reconcile a bunch of arbitrary objects and aggregate the status via probes so every object set you can watch the status and see if those objects are working as expected or not it's very important that it's immutable and can be scaled to 0 to archive for rollback so this solves already some of the problems tries to solve some of the problems that we saw before and one of the important thing I want to point out is this phase reconciliation thing because let's see it in detail right so in every object set we can define different phases and assigns the resources belonging to the bundle one of this phase in a way that each phase starts only if the previous ones have successfully completed how do we know if they are successfully completed we define probes for each phase that checks if the phase is actually completed right in this case we have two CRDs that needs to be deployed if some of you is not familiar with Kubernetes CRDs they are basically a way to expand the Kubernetes API defining extra entities that allows you to do whatever you want in a way very simple parallel parallelism as you create new entities on a database so you can define your own objects to work with and extend the Kubernetes API so here we have two CRDs we have probes that checks that they are established so those gets deployed and if the probe succeeds for both of them the phase is considered to be completed and only at that point we go to the next phase that contains a deployment that relies on those CRDs right because if we put together in the cluster at the same time the deployment may try to become available before the CRDs are established and something may fail in there because the new entities are not yet established in the Kubernetes cluster so after the first phase is done then the second one the deployment gets deployed the probes check if it worked and then the second phase is marked as successful so replica sets are cool as replica sets are but deployments are more helpful for us because they manage some of the replica sets internals in an easy way so we created the same concept for object set as well and we called it object deployment object deployment coordinates the transition between object sets and keeps the history so that you can roll back like limited history but you can do rollbacks if it doesn't work and creates a new object set when updated and keeps the old one alive until the new one is successfully executed using the probes that you define so this is what is actually helpful and solves most of the problems we saw earlier right then finally we introduce another concept that I spoiled earlier called packages which helps people to create those object deployments and packages are as I wrote in the slide like a single artifact that contains all the manifests, configuration and metadata needed to run an application and as like RPMs or that packages you have a build phase which you basically take your YAML definitions of your resources and put them in a folder with whatever structure you want and add a manifest that contains some metadata about the package itself you can also optionally add the readme file and an icon as you can see on top of the list and most importantly also you on top of like plain YAML file you can also use go templating capabilities to create resource templates that can be templated and then you build the package and everything gets packed into a non-runnable container image that you can store in whatever registry you want and when you want to deploy that package you create a custom resource that can be either a package or a cluster package depending on the scope you want to give to those resources you specify the image that you built before in the spec.image field and then package operator will pick that up and create all the object deployments and stuff for you so here is a brief description of the internals of package operator and then back to Niko for demo time thank you all of this might be a little bit abstract but essentially our goal here was that we take all the smarts that are that you find in like Argo and other open source projects and ship them in one single operator into those customer Kubernetes clusters that we manage so we can then instruct the on-cluster component to do those smart things on the cluster without requiring us to have outside systems exfiltrating data or working at that scale because at this point you can abstract a lot of stuff away and offload that to the cluster directly so what hope this works what you're seeing here is I have a local kind cluster is the font big enough okay font works kind is Kubernetes and Docker so it's super nice to get Kubernetes cluster up and running super quickly I have package operator running here and now I'm deploying something so this is the deployment API that Alessandro just talked about and what you see in here is we have one so this is the NGDX example deployment we have one phase the deploy phase so we keep it super simple for the beginning and in here we have two objects a config map and a deployment like the normal Kubernetes deployment for the actual workload and we define how package operator can make sense of those objects with the declarative probe so here we say okay select everything that is a deployment and check is it available and is updated replicas it's equal it's equal to status replicas which gives us not only the sign is the deployment available but also is it updated spoiler there are a few versions here I'm making it extra interesting so check the plan and create it so it's doing stuff it's immediately available because we preloaded all the images so we see there is deployment there is config map and pods super nice what sets this apart from a lot of other solutions is that when we watch because of those probes and everything being connected if something crazy happens on that cluster like somebody we see that status reflected in our generic application deployment and for us at SREs that's already huge because without looking at any monitoring system without looking at specific application telemetry we can already say something specific is wrong when we look at this deployment immediately Kubernetes is very bad itself and it's going back to work right away but if you happen to stumble over this cluster because of an actual incident having a single resource that tells you where stuff is wrong is already super useful so let's now update stuff that's where things get fun right so what I did just now is I'm doing multiple advanced concepts at the same time because this is the same object deployment I only just patched it with different data so we have a new release but we added a new strange readiness condition so here we see a new probe was added and for some reason I thought it's sensible to say for this to succeed an annotation needs to be equal to a key in the config map a very convenient way of essentially saying wait for somebody to do something and if we look at what the deployment is telling us now we see it's still available that's nice we get back to that in a second and it's progressing why is stopping it from progressing to be to at the moment because hey the deployment phase is failing at the moment because the config maps probe is failing because well that annotation does not exist on that object right and this could be any dependency missing from the deployment so if we something else that is funny here I on purpose renamed the deployment to v2 so again making the jump to a previous talk here at the conference about rollouts, progressive rollout this is a cannery deployment stretch right now we have two versions of this application running side by side until the new version passes all its rollout flags so they run side by side so fine like no customer would see right now that v2 is blocked by something and at the moment somebody is setting this probe right probe will say this annotation needs to be equal to this data key safe bots terminating deployment v2 deployment remains and only the v2 config map is still here because now the new revision passed all its probes and now the old one can go away because new stuff run out successfully they may say ok this is maybe a little bit too complicated when upgrading stuff but where this becomes super handy is spoiler but updating it again and you know mistakes happen mistakes happen all the time and if mistakes happen across thousands of cluster stuff gets expensive so what happened here is a typo in the deployment pipeline somewhere and an image was referenced that doesn't actually exist so a cube can't roll it out that's where we see here there are multiple backups this will never work but you see the v2 deployment is still here in operation and if we check status information we see ok the latest revision is unavailable because the deploy phase is failing because the deployment status condition is not available this is super useful for us because imagine again you are in SRE you are being woken up at 3am in the morning because some cluster is failing you can check this and you see and now if you this is where the demo ends update it to update it again notice the ok v4 is running v2 is gone and v3 because v3 never worked in the first place so we can get rid of v2 was working until v4 took over and now everything is fine again and we wrote a small helper that can also give us a rollout history on that cluster so we see ok the first thing we deployed a few minutes ago worked the second version worked eventually after we patched it the third one well known did never work so we marked it in our history as this was never successful and the fourth one worked in the end and this with just a single tool without needing to set up anything else right is not connected to monitoring solutions this is not needing data to get shipped off the cluster and this is what we are already using in production in some limited cases and want to build more tooling on it so that's what we did time for Q&A come again so the question was if the operator is certified by Red Hat so it's not in operator hub or officially supported by us because that's something we're only using internally we have though the project is open source and we're doing our best to keep open source stuff installable and nicely documented so if you want to check it out and if there is enough interest in it I'm sure we will offer it in some capacity package operator not one but one it's also in the Shed talk description anything else? alright let's repeat package-operator.run thank you very much ok hello everyone so thank you for being here it's the last talk of the conference thanks a lot for you for all the hard work that you have been doing and thanks for Dave Koff it has been really amazing I arrived on Saturday so I don't know for you that have been here from Friday you are heroes it's great I would like to present you some things that we have been doing in an open in an open source wide project in a European project that we have been working so we are going to talk about Edge Cloud Continuum we are going to talk about serverless and we are going to talk about how to design fast applications in visual programming so my name is Jan Giorgio this is a presentation that we have done with colleagues from George and from Luis and this is an agenda for initial and general we are going to start from a high level view of the physics project we are going to talk a little bit about details on the design environment how this is done and then we are going to dig a little bit deeper on how the infrastructure services take place we'll see some performance aspects and then we are going to discuss some optimizations we have been doing in terms of placement and scheduling okay so a physics project it's the acronym so it's the real name of the project is Android Space Time Service Continuum in FAS okay FAS as function as a service some of the main challenges that we have tried to target with this European project so we want to abstract the usage of service offerings okay service offerings and clusters across the continuum so when we talk about continuum we talk about cloud and premise clusters and and when we talk about service offerings we are talking about how to design applications in this type of environments okay and now the users are not always experts okay so that's why we are talking about abstractions here so there could be data scientists that have a specific expertise on how to create artificial intelligence applications but they might not know how to actually deploy them and even more how to deploy them in a complex scenario with edge cloud clusters we are talking about one of the challenges is the adaptation of code in serverless computing paradigms okay so I'm sure you have heard of there are open source serverless platforms there are also service offered in the cloud here we have taken a particular open source approach one other challenge is the investigation of space and time okay space in terms of location of execution and time in terms of duration of execution these are things that we try to tackle and again taking into account the whole continuum things are much more complex optimization of resource selection and operation so we do have different levels the global level taking into consideration the whole continuum the different clusters in parallel and the local level when we go and execute directly upon a specific cluster and finally how to reuse new artifacts how to reuse the particular channels and artifacts that we are going to create okay the goals of the project visual programming environment to create workflows okay with patterns that can be used with semantics that can be increased and how we can facilitate the developers how we can enhance the development experience with this type of environments platform level functionalities that will allow us to orchestrate in a cloud interplay and more provider local resource management in order to be able to offer a specific optimization while we execute the services okay for those of you that are in that where Europe provides funding for projects and this horizon age 2020 project it is a research and innovation action with a project budget funded completely for all the partners that have been that are participating to it and we are 14 partners with large companies such as GFT, ATOS HPE, Red Hat Fujitsu and smaller startups like Byte, Rx Technologies and finally universities like Harocopio and UPM from Spain GFT is the project coordinator and the end of the project is actually end of this year okay so we have passed all those phases and we are now in the final let's say implementation phase of the second iteration okay this is a high level technical architecture of the project and actually for those of you that don't know we separate it into different work packages and as you can see we have these are the technical work packages there are some others that are more related with how the project is managed or how the exploitation is done on the technical side we have the user layer the design environment and function DevOps phase it's the higher level of the schema here then we have the global continuum layer where we have all the platform services you can see here that the T represents the different tasks so within the work packages we have specific tasks that will come and tackle particular functionalities or services and so once we have the different services on the platform layer along with we have the semantics the way to optimize the placement across the different clusters data services and the orchestration we reach on the lower level the local level of the infrastructure where we discuss about we do research about the ways to do resource management placement optimization in the level of a particular cluster either edge or cloud we have different pilots different applications large applications with real real conditions the first one is related to smart manufacturing for increased resilience and interplay we are talking about industry for the whole scenarios with edge and cloud trying to be as resilient as possible each health scenarios with artificial intelligence and predictive scenarios where particular data coming from specific patients can be actually can give results and predictions on how they can be treated and finally smart precision agriculture where we take into account digital queen scenarios and how we can take into consideration different characteristics coming from smart farms such as how we can better control and take into account the resources so for example the water the pesticides that are used and how this can be optimally deployed within the smart farm now if we go a little bit deeper into the physics environment we have first of all what we wanted to do with the design environment is to enhance, abstract and then reach the way someone will create its fast application how we can customize the function environment we gave the ability to the user since this is initially something that will be installed on PC to customize it as much as possible through docker files and stuff like that we simplified the creation of complex functions workflows and this is for people that already have used serverless or fast you may have noticed that for example in lambda you will need to go through a big YAML file in order to provide different the different characteristics of the functions that you need to use and in particular we have cases where you may have applications composed by more than 5 functions so things start to be complicated and imagine even more if you need to do some joints so with a visual environment this is something that we tackle here so you can see in the right here that it's much simpler to actually do the programming like this rather than that we exploit reusable patterns these are things that I'm going to discuss also afterwards increase in mantics and also specific we abstract completely the packaging the testing and the deployment imagine for users that do not have these capabilities that are more experts in coding if they need this to be abstracted this is a feature to use here so these are the baseline technologies that we have been using so I don't know if you are aware of it it's an open source programming environment for even driven applications and so we have taken this as is and we have enhanced it with a specific palette of annotations and particular extensions that can actually facilitate the user in execution upon the continuum I'm going to talk a little bit afterwards about specific extensions and can be used in this context as a workflow orchestration but also as with function execution abilities and it can also be used as a choreographer for specific functions so you can either create your workflow and build it as an image which will be then deployed or you can enable the execution of different functions physics functions directly from Nordred and in parallel we use OpenWisk the open source fast platform which allows us to actually upon Kubernetes clusters the images that we will prepare directly through Nordred okay so this is the how is the design of let's say the local and the cloud environment in terms of the preparation of the life cycle of the artifacts okay so initially what you get is you get three containers that are deployed within your local environment and as I said the Nordred comes as it is and then we have specific patterns and annotations that have been enhanced by physics but we do not break the compatibility we just add those within the palette we have the control user interface and the serverless function generator which actually provides us the bridge towards the cloud the cloud environment where we have the backend which enables us to to take the code from the repository upload the flow to a bucket and eventually trigger a dinking stop with the flow which will in the end prepare the artifacts to be deployed by open risk right so yeah this is the user interface that we can get from physics so through that you can have the function creation the build management, the testing, logging etc there is a video if we have time we can check it afterwards but I will go through some of the characteristics here so we have the view of Nordred directly with all the different annotations and patterns that we can use and by default the language where the different patterns are programmed in Nordred is JavaScript but in particular since there is the support of docker images we can actually have different types of languages directly there and so yeah afterwards once you have built your flow within Nordred environment you can come and check the build that has been done directly from the admin panel of the physics application and here you can come and invoke a particular execution which will then send it according cluster to be executed there is also logging information taking place directly within the environment of the physics application and you also have the ability to connect different flows within specific applications so for example if you may want particular parts of an application to be executed on the same cluster this is something that you can do what is interesting with physics is that we managed to enhance the different capabilities of Nordred with specific patterns that are usually used and demanded in the case of complex applications applications related to machine learning or applications related that need to be executed in a parallel mode in a multi cluster environment so these are things that by default were not treated within Nordred but we have enhanced it for reusability manageability and abstracted functionality so as you can see here as parallelization I am going to talk a little bit afterwards about that with a split join scenario things such as context management you may need to have a specific action a specific container that will share the context with another container being in the same being in the same flow so this has been created in advance or things such as how you can be you can create a retry scenario when you want to save data within an object store stuff such as request management or branch join these are things that have been introduced within the physics palette and because are used and demanded in a more complex scenarios ok so in the case of the fourth join parallelization this is quite interesting because as I said we start from a concept the concept of how we can start let's say we can have an array of inputs then we split this function into multiple functions we have the fast invoker that will come and will deploy a number of containers based on this split and in the end we have a function that will join with aggregate all those results and this concept here is implemented by this complex let's say flow which is then created as a single function ok so in the end when you want to use it you just need to drag and drop this particular box and the user just needs to parameterize it in order to have the result we implemented here ok and this is something that is actually quite huge especially in complex cases such as machine leveling or you know where you want to improve the performance of a particular execution we have also different capabilities here such as you create different functions or you may create different threads within the same functions or different processes within the same container depending on where you are executing ok other semantic nodes that are interesting to mention here is things such as affinity anti-affinity these are things that you may use quite often in scenarios where you have sharing of functions within the same nodes or you prefer not to put them in the same nodes locality of functions you may prefer that a particular part of your application is executed at the edge if you have data privacy issues for example or that a particular part needs the cloud if it needs high performance for example the importance of high-medium low for prioritization optimization goal you may have different demands in terms of performance energy function sizing demanding particular resources these are things that are treated the requirements of the architecture and things such as how to link particular data services ok now if we go a little bit into detail on how things function internally in the infrastructure layer we do use Kubernetes ok ok D or OpenShift we use Submariner for the connection of the clusters we use open cluster management for the setup we have evaluated and using MicroShift for lower footprint Kubernetes devices of course for monitoring we use OpenWisk and we have started evaluating also ok native for function as a service but mainly OpenWisk at this time at this moment and then we take into account we use Kubernetes operators and webhooks in order to manage how the components will run on top of Kubernetes ok so yeah this schema on how this are setup let's say in our multi-cluster environment you see different clusters here the hub cluster has the OGM Submariner allows the connection and then we have the semantics coming giving the description of the resources of particular nodes and now if we check how the function registration takes place we start from workflow CRD ok which gets the first demand of what we need to execute and then we get the object directly from a managed cluster we trigger the reconciliation loop in a workflow CRD operator and this will allow us to register the function within OpenWisk ok and eventually store it within ATCD on the local cluster and on the OGM so that we can be able to track it afterwards and now once this is done then in the in terms of function execution we have we have the execution of function which is triggered within OpenWisk if there is already a hot container there a hot pod then we can actually use it or else a new pod is needed and hence we go through the webhook is going to allow us to enable specific annotations for a particular scheduler and this is something that we will talk also afterwards on how specific schedulers that we have implemented that would allow us to make some optimizations going beyond the image layer that is implemented within Kubernetes and into account the layers of containers and so once this is done then we have the particular annotations related to the particular scheduler we have also collocation annotations such as we want to prefer to have collocation between a specific function or not and then the particular pod is created on the particular function is created up on a particular pod deployed on a specific node ok so if we take a little bit some performance aspects we have done some experiments and some that were interesting to show here we have triggered workflows related to open-wisk sequences of functions we have also triggered workflows with full primitives and intra-container parallelization so in this case we have a particular within the local environment of Nodred we have the function the different functions which are triggered directly by Nodred in the local environment and finally we have the full parallelization directly by the open-wisk interface which can deploy different containers directly on the different clusters ok in these scenarios we have measured how what are the delays that we get in those three scenarios and as a matter of fact it's interesting to see that of course if we are in a Nodred environment this delay is minimal so if we want to test something very fast but with limited parallelization only within the same container then this will be quite quite fast to deploy whereas in the case of the other two the open-wisk cases then we are talking about 100 milliseconds of delay to deploy the actual containers and some other interesting measurements here related to the space-time continuum where we have we have things such as the network latency ok we have three different clusters one that we consider as a small machine located in Greece another one and two others that are more like cloud machines the one is on AWS and the other one on Azure and through the experiments we have tested to get answers about what are the configurations that we have done and if they were correct or not so for example in terms of latency we can see here that the execution of the particular functions in here it was related with artificial intelligence cases of our eHealth application we can see here that it was much faster of course since it was the edge case it was executed in Greece and then we have since the call was done from Greece it was the latency was much smaller to reach the Netherlands on the Azure cluster and a big year to reach the Netherlands which is the other cluster other interesting things were the fact that in the waiting time we had a long waiting time for Azure cluster here and this was because we haven't configured well the memory of OpenWisk and if we had 32 gigabytes of RAM available we had only configured 2 gigabytes of RAM to be dealt by OpenWisk and this was the difference we had between AWS and Azure let's talk about some specific extensions that we have done in terms of scheduling so we have been using okay we have been using so two different levels of schedulers one in the global continuum and another one in the local continuum and the local level and so these both were implemented in the two different layers the first one was actually related to how we can have multi objective scheduling taking into account performance and energy whereas the second one is more related on how we can improve and minimize the cold starts taking into account the layers locality not only the image locality okay and this implementation is currently ongoing and we are in contact with the upstream Kubernetes community in order to push these changes in the mainstream in the upstream version so last slide one of the things that is interesting with this type of project is that we have implemented reusable artifacts okay in the marketplace there is a marketplace that people can use and there you can find the catalog of available artifacts, operators, no dread floats, data sets, semantic models and stuff like that and people can actually test directly okay so thank you very much and if you have any questions I would be happy to answer thank you very much