 Hello, everyone. Welcome to Cloud Native Live, where we dive into the code behind Cloud Native. I am Annie Telvastro, and I'm a CNCS ambassador, as well as Senior Product Marketing Manager at Camunda, and I will be your host tonight. So every week, we bring a new set of presenters to showcase how to work with Cloud Native technologies. They will build things, they will break things, and they will answer your questions. So join us every Wednesday to watch live. This week, we have a great session called Live Shifting Through Top 25 Containers. So we're really looking forward to that. And another exciting thing happening this week is that the program for KubeCon Europe for this year will go live, so I highly recommend registering and getting your ticket this week or later as well, of course. We're going to be in very lovely Spain this time around. And as always, this is an official live stream of the CNCS, and as such, it is subject to the CNCS Code of Conduct. So please do not add anything to the chat or questions that would be in violation of that Code of Conduct. Basically, please be respectful of all of your fellow participants and presenters. That being said, we encourage questions so much, so if you have any questions, just put them to the chat and we will get to them throughout the program today. So with that, I'll hand it over to Josh to kick off today's presentation on topic. So there you go. Awesome, thank you so much, Annie. I am so excited to be here. This is such a cool topic. So I'm going to start up by showing my screen. And I want to ask the audience to send questions like the whole time. This is meant to be a completely interactive session. I have a bunch of things I can speak on, but I would much prefer to answer your questions. And I'll kind of start with this as a dashboard right here that I put together and it is kind of the data behind everything we're going to do today. The title is live sifting through the top 25 containers. And if we look over here, it's actually 155 containers because I kind of figured some stuff out. And originally it was going to be 25 because I had manually typed in the 25 top containers, but I figured out the Docker API. So I pulled in all of the Docker official images, which ends up being 155. Docker will tell you it's 177. There's a long story why that is, and we can get to it. But, and the reason it has sifting in the title is I work for a company called Anchor. And we have two open source projects called SIFT and GRIP. And SIFT is an S-bomb generator. I'm sure many of you have heard of S-bombs at this point. It's a very popular topic. And GRIP is a vulnerability generator. And so what I did is I took, I took SIFT. I wrote some scripts that downloaded 155 containers and it scanned them and spit out S-bombs. And then I did the same thing with GRIP where I scanned the containers with GRIP and I put it on Elasticsearch. I used to work at Elastic, which is a company behind Elasticsearch. And so anytime I see data, the first thing I think of is what can I do with this data in Elasticsearch? And this was an amazing opportunity to do that. And so we'll kind of cover a couple of the bits of data here we're looking at. So 155 containers, in those 155 containers, we have a total of 14,000 packages, which is a lot of packages. And this is part of the fun is when we start looking at this data just in aggregate in the big picture high level views, we can start asking some weird questions. 10 types of packages. This thing's like RPMs, devs, APKs, node packages, all that kind of stuff, which is kind of a small number, but that's okay, that's just how it works. I think it shows that an enormous number of these container images use very similar types of packages. 1.4 million files, which I love it because that feels like a huge number and it is, but when you're working in Elasticsearch, it's super small. Like do you have any idea, this is all running in my basement on a moderately sized computer. Like this isn't some super huge cluster or something like that, which makes it super fun. You could do this on your laptop, which is amazing. 290,000 unique files. So what one of the things I did is I took all the files and then I said only show me like one instance of each. So it's only 290,000. And the reason we go from 1.4 million to 290,000 is a lot of files have the same name, like Etsy password. Almost every container has an Etsy password. In fact, we can figure out how many of them do in a little bit if we want. And then the bottom line is we hit some vulnerabilities, 5,934 vulnerabilities, that's a lot of vulnerabilities. That's unique vulnerabilities again, because things containers share pieces. And so there's gonna be similar packages all over. And aggregate total is 72,000 vulnerabilities, which sounds like a lot, because it is. And then the other thing I put here is, let me hide that quick, the total number of licenses, 793 licenses. And of course you're gonna say, wait a minute, there is not 793 types of licenses and you are correct, there are not. And we can explain why that is. Okay, so as I said, as I go through this data, like please ask questions. Anything that pops in your mind, like write it in the chat and we'll get to it, because this isn't a demo in the regards of its very canned, this is data. And we can do anything we want with this data, which is what makes it so amazing and fun. All right, all right, so let's start. Oh, you know what? Yeah, we'll start with some pictures. And then I'll kind of talk about how I did all this in a little bit, because I think this is what you're here for, right, is the eye candy. So this is the first graph I'd like to show. And this has the number of packages in a container. And this is not a mistake, 3000 packages in a container called Rocket Chat. We have silver peas and we can expand this out. And we have the first question again. Oh, two actually already. So Mark asks, why do you prefer yield key elk over Prometheus? And then, yeah, let's get to the next question as you answer this one. Yes, okay, so I prefer elk because I worked at Elastic. So I know it really, really well. I spent four years in change at Elastic. And so it's just because I know the tool and I have remarkably little startup time. There's no reason you couldn't put this data in the big data system in your preference. It's just, this is what I know, so it's what I use. And the second question, we'll be sharing the scripts. Yes, absolutely. So I have a GitHub repo here and this link will be pasted somewhere I've been told. And there we are, right there. This is all public. Everything you're looking at was generated out of this repo. These scripts are hard to use. So if you want to do this, like hit me up on Twitter or something, I can help you out because I don't expect anyone to take this and be able to use it without some assistance. This is, we'll say heavily geared towards me because it is what I know. So anyway, anyway, let's look at the packages, right? So this is just the top 10 or top five. And let's make this big. Let's look at what the distribution of packages in containers looks like if we make it absolutely enormous, right? And so here's where we end up with our great big ones. And you can see it's a pretty typical just kind of distribution that you see anytime you're dealing with, you know. There was another question as well, kind of continuation for the first one. I might guess the answer already, but which one do you prefer, CloudWatch or Elk? Well, I mean Elk, obviously. It's again, it's what I know. And honestly, if someone else wants to take this data and do something with it and put it in another system and discover new things, like let me know, I'm all for adding scripts and expanding this as much as we can. I found the data fascinating when I started doing this. It's one of those projects where when you take a bunch of data and you put it into a system, you don't always know what you're gonna find. And so it's kind of like an adventure of is it going to be interesting? Is it going to be boring? And so I spent probably a month or so like just investigating this before I said, yes, this is interesting. This is definitely worth talking about. So, okay. So these are just the number of packages in containers, right? That by itself, I don't think is particularly interesting other than knowing there are some containers that are huge. There are some containers that are small. This should surprise nobody, right? So I have some slightly more interesting. I have, I really like this one. This is where we look at the package types in each container. So like we have our rocket chat, right? Which had 3,000 in some odd packages in it. But now we can look at like what are the packages? What is the data in this? And we can see like NPM is almost all of the content we're seeing in that container image. Which again, anyone who's done a lot of NPM development this is not a surprise in any way. Just because when you do an NPM install, there are dependencies of dependencies of dependencies that do this. And so we can look at that. We can look, what is the next one? Silver peas, a lot of Java. And we've got our colors over here. We can kind of pick out what they are. But you can see like obviously in the top containers they appear to be Debian containers. A lot of Debian packages, no shocker there. NPM, the things that use NPM use a lot of NPM. And again, I mean that doesn't surprise anyone I think but it's interesting to look at this. And so again, this is where we can take and we can expand this out and look at all the containers, right? Now I also, I wanna stress when you see me change numbers and the data appears, this isn't contrived data I've pre-populated. Like this is actually happening as I'm typing it which is why it's so much fun to do this. When you can get instantaneous answers to your questions data is fun to explore. When you have to ask a question and wait five minutes the data is not fun to explore. And I've been in many of these situations before. So anyway, now we can kind of get a look where we see, there's some RPM in here, not a huge surprise. We got PHP, which holy cow, PHP. We Jenkins, Go modules, gems, APKs, Python. And again, we can see like Java makes up an enormous amount of these container images. Debian, like a lot of these are Debian. And I suspect that's because you've got Debian containers, you've got Ubuntu containers and they make up the base image of an enormous amount of these container images. But I think there's also kind of a piece to this to think about is when you're using something like Alpine to build your containers, Alpine images are generally small. So obviously it's not going to look as impressive on a graph, whereas some of these other base images are a little bigger. Things like Debian, Ubuntu, Red Hat, Fedora, those are considerably larger container images, which is fine. I'm not saying there's anything wrong with that. I'm a firm believer in like do what works. I don't like to claim one thing is better than the other necessarily. And then of course NPM. And again, we see the top containers have a lot of node modules. No big surprise there. All right. So what are the most popular packages we see in the containers? That's kind of the next one. I think if we look at this list, it should shock nobody. A tool like tar, lots of people use tar, bash as a shell, grep, sed, GZip. These are all pretty normal packages you're going to see. And again, we can kind of expand this out. Ah, I forgot to close signal. Let's make it a thousand and then we'll update this. And we can see again, we have a similar distribution. We can't read anything of course, which is part of the fun, but we can kind of take a peek at what some of these are. We've got up here, we're in the thousands and we drop down into the hundreds pretty quickly. And then every container kind of does its own thing. This one isn't, I think, particularly interesting by itself, it's just, it was fun to look at. So I put it in there. Okay. Now we did, just overall package types. I like this graph too, because instead of breaking the packages apart into each container, we just look at the raw numbers, like what are we seeing here? And I believe this is, yeah, this is all of them. This is all of the packages we see. Now it's, you could say, oh, but I know there's way more packages in these containers than that. And there's kind of a couple of pieces to this. I mean, part of it is the intention of containers is obviously reusability. And then there's also just going to be certain package managers that SIFT doesn't support yet, SIFT being the open source tool. You know, if there's something it doesn't support by all means get involved, patches, submit bugs, all that sort of thing. Because obviously the most popular ones are the ones that get the attention here. But we can see obviously Debian packages make up an enormous amount of what we see. And again, just because the base images are a lot of Debian, this is a surprise to nobody. Java, you know, we saw all the Java archives. Again, not a big surprise. Node modules, RPM. I think anyone who looks at this is probably not shocked in any way at what they see as the results, right? And I think part of this as well is just that the some of these other, you know, like Gemini, APK and Python and even go to a degree, they're less, I guess, dependency heavy compared to some of the other languages. Like obviously Node is very unix-y like that of lots of very small modules. Whereas things like, you know, Python is more about kind of your more feature rich modules. So you'll just naturally have less of that, I think. Is the way to look at that. All right. License types. I like this graph a lot because it's weird. So again, I don't think anyone's going to be surprised by the top handful of licenses we see here. I won't go into what all the open source licenses might be. And so like- Questions when we- Perfect. Yes. Ah, cool. Kurt asks, is there any differences between the D, B, K, A, D, B, K, G, paste vendors, for example, Ubuntu or Debian? I need an expansion on it. So actually, I know Kurt, I do a podcast together. So he's probably here to harass me, but that's fine. I mean, of course there's differences. I guess, what does that mean? What are you looking for for the differences? I mean, like the sheer numbers. I guess we can just explore and Kurt can type in questions. Like here, I can, if we take package names and now we can do things like, do I have base image? You know, I don't know. Let's just pick on container name is Debian. Let's just pick on the Debian containers. I don't, why am I getting nothing? Oh, cause it's Debian latest, isn't it? And this is part of the fun too. Is remembering what you did. So you can, so here if we look at- Yeah, we have a clarification from Kurt. Are they on par, for example, with base images or is one significantly better? Yeah. You're asking if one Linux distros better than another. Like, nope, I'm not going there. But so like, for example, here's what you see in the Debian container, right? This is, these are how many packages there are. And then we can switch this to like Ubuntu, for example. And we can get a feel for what the base images look like there. And you could, they're literally the same thing. I mean, I'm not, I'm not that surprised. No, they're about the same size, Kurt. I mean, as we can see, it's functionally the same thing. No surprise there. I know there is a way to tell which base image something uses, which I haven't done. We can try to do that later, but I don't want to go down this rat hole yet because it's going to take me a few minutes to figure out how to find that data. So I'm not going to do that at the moment. That funny enough seems obvious in hindsight that I should have done this, but I didn't. But it's all part of the fun. So I want to go back to license types and I want to show something off that's really weird here. So these are the top 10 licenses. If I show all of the licenses, right? Cause there's like 700 and change. It's kind of a mess, but you see down here, like there's a lot of licenses that are barely used, right? Like we get into the single digit territory. And so an easier way to do this is we take the, we just reverse it, right? Instead of showing the most used, we'll show the least used licenses. And you can see, you end up with some really weird stuff. So like these are pros licenses people have put into the containers. In this case, I'm logging whatever the package says it's licenses. We're not doing any sort of normalization on this data because I find it a very interesting data point in favor of normalizing the data. If you look at like the SPDX as bomb format, they have a confined list of licenses you're allowed to use. It's not like an open field. You have to pick one out of their list. And this is why, because if we look at this, you can see like people put in all kinds of wild stuff in these licenses just because it's what people do, right? Like obviously in some of these, you can pick one license or another license or I don't know, I have no idea what unlimited means. Who knows, right? So this is just what happens with licenses, right? Is there are many, many open source licenses and it's not always simple to pick one. I think we're doing a slightly better job in many instances today versus maybe what we did in the past, but it still happens, right? And so this is actually something the open source community probably needs to address at some point is just this, I don't call it like, it's not license sprawl, it's just license correctness. All right, all right. So let's keep going. Here's another graph I like. I took all of the file sizes that the SBOM reported and we can look at this and say, this is a ridiculous graph and it absolutely is. I love that there's one pack, one file or six files that are 106 megabytes in size, which is strange, but whatever it is what it is. But here's where we can actually do some magic with Elasticsearch and we can kind of start looking back at the smaller bits of data. And so here's where we start getting, we can see like right now we're into the various K and we can see here there are 50,000 files that are zero bytes in these containers. 50,000 is a lot of files, but what we can do now is we can obviously pull that out. So we can say, we're gonna say size is not zero. And that'll get rid of our zero. So now we can look and it's still there. Why did it, oh, it's a long story why it's still there, but I'm not gonna obsess over it right now. I have to get rid of all this other stuff. Yeah, there's another question in the audience as well. So there's Gillibur asking, did you search for a specific file name to determine the license type? The license type is provided by the package managers. The way SIFT works is it will look at, like for example, if you use NPM, it looks at your package lock.json and then it will determine the license type from the installed packages. Same thing from like the RPM database and all of that. So if you have a file on disk, not part of the package management system today, that is not reported as part of these results. So, all right, all right. So anyway, anyway. And then there was another question from Chris actually. So is it picking up some links as zero? Yeah, yeah, it is. And there's just a lot of zero byte files. It's shocking how strange some of this data can be. And okay, I need, hold on, let's fix this. Let's look at our filter. Let's say, get rid of this. And this is of course part of the fun of giving live demos because this worked correctly when I did it the other day. And now, okay, we look at one, two, let's say a megabyte. All right, what do we look at now? And you can still see there's a lot, it's still reporting, those aren't actually zero. It's because, I remember now, it's because of the way the histogram is being reported by Elasticsearch. But you can see it's mostly small files. That's just kind of the point of this graph. I don't wanna obsess over it because it's not super interesting, but it amused me to look at. Okay, okay. So that is kind of what I put together around the packages and what we can find. And we can dig into some more stuff in a little bit. And I think there's, before we move on to the vulnerabilities, I wanna kind of explain what I've done here and how this all works. So I've got a terminal, I usually don't like terminals, but this is the project. These are the scripts that were used. And in this top containers, Jason, these are all the containers that are included in this dataset, right? And then we query Docker Hub, we pull them down and then we do our business. And what happens is we have to run this thing called build.espom, which just runs SIFT over the containers. And I'll kind of give you an example. I've got a little demo I can give here of SIFT. I run it out of the container just so it's easier to pick up the latest version, versus running the binary on my system. I like containers for this, but if we just run it, you're gonna see it basically will take a container, in this case, I pick L-Pine because L-Pine is really small. And then it just spits out, like this is the stuff it found in the L-Pine container. And if we pick a big container, obviously the list is enormous. And there's just various ways to kind of slice and dice this. There's a nice output format SIFT has. If we give it like dash O JSON, it spits out a bunch of JSON data in its own format that's called SIFT native. And then this is what I took. I took this JSON and I literally put that into Elasticsearch. And that's where all of that package information came from. And then likewise, there is the other tool, GRIP, which L-Pine is boring because actually here, I'll just run it. If we run GRIP over L-Pine, the L-Pine data is boring because there's no vulnerabilities right now in the L-Pine base image, which is cool, but not exciting for this demo. So if we pick on something like Debian latest. Perfect. And then there is a question again, which is by the way, amazing. Thank you so much everyone for asking so many questions. Yeah. So Tani asks, how do you decide top 25 containers based on the number of downloads or number of stars on Docker secretary? Okay. So this isn't, for anyone who joined kind of late, I'll explain again a little bit. So this isn't the top 25 containers. I'll pull up my dashboard where we show, my data has 155 containers in it. So when I first did this, what I did was I just went to Docker hub like and I took, oh crap, this could make me log in. So when you go to Docker hub and you log in, it shows you the container images in a list. And I just literally went down the list. I have no idea what the order is Docker picks by default. And that was when I first started doing this. I literally typed the container names in by hand. So what I did now is I figured out if I open my terminal here, I have a tool called get containers. I figured out the API call to ask Docker hub for all of the official images. So this creates a list of all of the official Docker images, which is actually, it's in this top containers. It's like 170 some, but some of the container, like a scratch containers empty, you can't scan that. There are a couple of containers that were S390 images that SIF didn't like. And rather than deal with it, I just commented it out. And so we end up with 155 when it's all said and done instead of 170 some. But that is where these are the, like if you type Docker pull, like Debian, right? Normally you type Docker pull, like what namespace slash container name, but the Docker official images, you don't do that for. And these are all of the Docker official images, which I found to be vastly more interesting than just 25. So I'm glad I did figure that out. Okay, okay. Oh, and Kurt asked another question. Yeah, it's kind of a question, kind of not. But I guess I loving all the interaction from Kurt. So he says, I'm curious because in Docker, how Debian was updated nine days ago and it was born to six days ago. So in theory, it shouldn't be that bad. Well, so it's kind of not. So here, let me, let me run a Debian, I run a Debian, I don't want to obsess over a scan result right now because that's kind of a whole nother topic we could discuss for days and days on end. But what happens in a lot of this is a lot of these don't necessarily have fixes, right? You see fixed in this data is coming from the Debian security, their advisory database. I forget what they have a fancy name for and I don't remember. But in like, in a lot of these cases they won't fix things which we're all familiar with. Sometimes there aren't fixes for a variety of reasons yet just because obviously upgrading some of this. And what you end up with is like these severity ratings come from NVD, the National Vulnerability Database, they're always wrong. And so basically, if you look at this there are no fixes available for these particular IDs. And it looks worse than it is, but the vast majority of them like see this negligible. This is where Debian basically said like whatever it's not really a problem, we'll fix it someday. And actually, I have a graph that shows this in fact now that you mentioned it. So, okay, I think I answered all the questions but keep them coming, I'm loving this. This is super fun. So let me show that, okay? I've got, let me open my visualized library again. And I actually have vulnerability severities. This is where we can kind of pick on this, right? This is, if you look at all the vulnerabilities across all of the container images, this is the breakdown. Now this is the data that comes from NVD functionally, which again, like I said, it has some issues. I don't want to get into why that is right now. But I think this is a pretty reasonable representation of what we're looking at, right? A lot of mediums, not a big shocker there because any of us who deal with vulnerabilities on a regular basis knows that when you have like low and medium findings, those are usually, we'll get to it, right? I'm busy putting out the fire over here right now. And then obviously negligible is a rating SIFT will is, or I'm sorry, it's a rating gripe will assign to a vulnerability when it knows that the, like Debian for example has basically said, this is not a big deal. And so obviously there's a lot of those. And in these numbers are obviously gigantic, 23,000, 21,000. A lot of highs, 11,000, that there's a reason for that. I'll explain it a little bit. And then obviously low, and it's good that critical is so small because we don't want critical being a huge number. And then there's a handful of unknowns. Well, a handful of 2000 is more than a handful. But again, I don't think this graph should shock anyone in terms of what the distribution looks like. Obviously we would love to see zero in all these slots, that's just not a realistic expectation. So if you need to see vulnerabilities, you don't want to see lots of criticals and highs, right? You want to see the low things and that is what we see with this data. So now why are some of these so big? And this is where this graph is horrifying. Right. The Rails container has 7,000 vulnerabilities in it. When I saw this graph, my first thought was, well, that's not right, there must be a bug in something. And it turns out it actually is right and I can show you why that is. If we look at the Rails official image, it's deprecated, they say use Ruby instead, and it hasn't been updated in like five years. I think we can see that in the tags here. Yeah, latest Rails five years ago. And so that's where I think by itself, it's easy to pick on the Rails people or Docker or whoever. Like it's hard to deprecate things. If they would turn this off, they'd probably break a lot of infrastructure. Hopefully it doesn't get a lot of updates or downloads. I don't know where that came from. But anyway, the point is the value in this data is being able to get this like holistic view of your container images and what's happening inside of them. And then you can make some decisions. You can say things like, okay, I'm using this Rails container. I didn't know it was end of life. I obviously need to do something about this. Whereas without any ability to scan or gain insight into what's happening in your infrastructure, you might be blind in this way. And you could think, oh, everything's great. I run Docker pull latest for this image and it's always pulling in the latest one. So I don't have any trouble. But the reality is, you know, you might, you might. Yeah. There's another question once again. Thank you so much for meeting up for asking. So they say, I find that few high severity will never be as bold fixed. Who says won't fix the vendor question mark? Yes, the vendor. It's the vendor. So in that case, it's deviant. And so I'll give a quick little recap of why that is. So there is, here, I can just pull it up. There's nv.nvd.nist.gov, right? And they've got vulnerabilities they score. And whatever, I'll just pick on one right here. Okay, so they take these vulnerabilities and then they assign a severity to it. And this is where it's 9.8 critical. It's a CVSS score. And when NIST does this work, they take a worst case view of the vulnerability. So for example, one of the things you'll see a lot is in a lot of G-Lib C, like, do I still have this up? I do. So like there's G-Lib C, right? Which is the core C library of like every Linux distribution. Every one of these has A-Lib C. Most of them have the same one. You'll see there's some findings here that like deviant said won't fix. And so we can grab the CVE right here. Let me open my web browser back up. I'm just gonna search for it because it's quicker than figuring out the NVD search box. Right, so NVD Marxist is critical and Debian said won't fix. And the reason for that is that whatever this bug is is not something that Debian uses or makes, I guess the library might not be exposed in a way that's vulnerable. And so Debian might say, this isn't a big deal. It's very hard to fix. Like we'll fix it someday, obviously, when they upgrade, we're not gonna treat it as a vulnerability by itself. And so that happens on a regular basis. From the NVD perspective, they're making the assumption that this particular function is available over the network because they don't know how anyone's using it. So they have to have the worst case view possible whereas Debian knows what they're doing with it. So Debian can say, I'm going to decide this won't fix. Now that of course creates some contention because I've had many organizations that'll say, oh, we only trust NVD. And then it's like, well, there's nothing I can do because Debian didn't update it. And again, I think this is one of those places that this is a very immature space still. And we have a lot of work to do to understand kind of what does all this mean? How does it move forward? And we don't have all the answers. Like this is part of the discussion. And I also think that this kind of data, part of the value of this data is being able to look at some of this and ask questions and say, like, why is this happening? What is going on here? And I think that's the value of it, right? It's easy to talk about data when you're looking at it. It's hard to talk about some of these concepts when they're just these ephemeral things floating around and we don't totally understand it. So, all right, that was a lot of explaining. And if anyone wants to discuss vulnerabilities, I love discussing them but I don't want to obsess over it right now. So, like, come find me on Twitter, hit me up somewhere else. I'm even happy to do another live stream just about vulnerabilities because it is a super interesting topic. But, okay, okay. So, right, we've got all these vulnerabilities now. Vulnerabilities per container. Let's, I'm gonna explode this one again. We'll look at what it looks like kind of as a whole. This one takes a little longer. Oh, it failed on me. That's funny. Let's make it a little smaller. All right, this is just the top 100. If I go all the way, it obviously didn't like that. But you can see it's a pretty reasonable distribution but we still have a lot of these containers have a lot of vulnerabilities in them, right? It's a significant number. We're talking hundreds for the most part. And why that is is partially what we just talked about around some of these vulnerabilities aren't that bad. And part of it is also just how fresh is this stuff? I don't have freshness information in this. I don't know how old some of these packages might be. But I mean, we've all seen projects where you do your NPM install once it works, you never touch it again. And now you've got a bunch of vulnerabilities. And so this is one of those instances where I hesitate from telling people to obsess over individual vulnerabilities but use graphs like this as a way to just kind of gauge how things are going. If your vulnerability graph is constantly going up and to the right, like you probably need to take some actions because at least you want it flat. Ideally you want it going down. And again, this is where when we have the data we can start asking questions. One of the things I want to do next is these are just the latest containers in most instances. I also want to pull in historical containers. And then we can do things like say, okay, like what does the vulnerability graph look like for this container over the last two years? And then we can start seeing is it going up? Is it going down? What's going on? And this is an instance where if you have a container image that's unmanned. Did the sound go off? Is it just me? Yes, I think it's not just me. So we are, we lost sound. No worries though. I think we're going to get it back hopefully soon. You know, usually is a way to fix these things quite fast. There we go. My audio device kind of, oh, I think my network shut off. I just got a whole bunch of messages to all right, all right, we're all good now. Yeah. It's a bit louder than before, but I guess that's fine. I'll turn it down just a bit. All right, that should be a little better. I love live stream. This is part of the fund, right? Okay, okay. I'm sorry about that everybody. All right, where was I? Okay, we did container vulnerability. Okay, here's one that I like. This is where we can break apart the vulnerability by severity, right? We just talked about all of this severity information and where it came from and what's going on. And now again, we can split apart our graph. So I guess one other note I'll add, the I prefer the horizontal graphs because they're easy to read as humans. We don't have to turn our heads, you know how to read, but unfortunately I can't make these pretty fairy cake graphs horizontally. I can only do them vertically. So we're kind of stuck with it. But we can see, right? If we look at like critical findings, we know there's not a lot of critical vulnerabilities. And when we look at this graph, we see that, yes, there aren't that many. In fact, even the Rails package or the Rails container, which is by far the worst one, you know, we're talking medium, negligible. There's a lot of high, but you know what I mean? It's not, I guess it's less terrible when you kind of look at it in that context. It's still too much in my opinion. I would definitely ask questions. But again, this is where we can kind of explode this out. I'll just do the top 100. So I don't break anything again. That's how demos work. But we again see, you know, we got our kind of top two containers. We can look, you know, medium, negligible, the critical or what we really care about. And critical is, yeah, you can't even see it for the most part on this graph, which I think is really exciting. So, cool. I've also got, where's my other one? Vulnerability, severity, vulnerability, package types. So this one, I really like. This is where we look at, we can count the, we take like the type of package versus the number of vulnerabilities each package has. And so in this case, we can see that like Debian packages account for the vast majority of the vulnerabilities we see. And part of that is because like when we, you know, we scan just the latest updated Debian image, we get a bunch of stuff in it obviously. And then there's also those older images, like with Rails that account for some of this. And then you can see, like if we, we can take out Debian packages, for example, when we look at some of these graphs and that'll change what it looks like. And then obviously Java archive, RPM, no surprise. NPM is a lot lower than I expected. I really thought we would see far more NPM vulnerabilities on this graph. And so that was actually a pleasant surprise to me when I looked at that. And so now here's where like we can add a filter. So we'll do package.keyword is not deb, oh, not package name. I need type, type is not deb. Okay, so we do that. And now this is going to filter out all the Debian, right? Debian disappears from the graph. We know that as now it's cool as we can do this thing we call pinning where I just pin this filter. So now all of the visualizations we look at take out Debian packages. So now let's look at vulnerabilities by container with no Debian's. Holy cow, it changes a lot, doesn't it? Like don't get me wrong, like 1,400 vulnerabilities and one thing is still a lot, but like the Ruby, the Rails container disappeared. And that's because the vast majority of the findings are from the Debian packages. Now granted, that's five years of Debian security updates which is an enormous amount of security updates, obviously, as we know. But this is kind of where some of this data gets more fun. So let's go back and look at our vulnerabilities by package name. So what happens now? Oh, Jackson DataVide. Anyone who does any Java knows Jackson DataVide well. But once we take out, you know, we can disable this, you know, temporary disable it for a minute. And now we see we get all of the, I guess Linux packages back because we know they're coming from Debian in this case. Like that's interesting to me. Now I want to specify, this does not mean don't use Debian or Ubuntu as your base images. This simply means in the data set I have, this is the results. You need to look at the containers you're using as well as the vulnerabilities in them to make your own determinations. One of the dangers of data is drawing incorrect conclusions from it. And so I want to be abundantly clear about that. I am in no way disparaging Debian. All of my containers I use, use a Debian base image. I do not stop doing that even though I know that they might have slightly more vulnerabilities than say Alpine. It's just I'm comfortable with what they do. I trust the Debian security team. All right, let's re-enable that and look at some more graphs. Excuse me. So here's where we can look at the severities, right? We've got this. We see medium, high, critical, not a lot of lows anymore, which doesn't surprise me. I think a lot of the low and negligible are coming from the distributions and there's kind of two reasons for that. One is that the distributions are busy and there are more security vulnerabilities than they can fix given just the available time, right? So obviously things that are low and medium are going to get fixed less often. As they should, I don't want Debian fixing low things when there's a bunch of critical vulnerabilities. But then that negligible kind of goes away because if we look here, there's no negligible in this list. And the reason for that is the negligible data is coming from Debian. When we take Debian out, now we don't have negligible data. So we don't have a way to report on like, we don't know if an NPM package is negligible, using that definition because that information just isn't provided. And in the case of gripe, it is using publicly available data whenever possible to figure all this out. All right, let's see. What else do we have? Again, container vulnerability by package type. Wait, this isn't right. Oh, no, it is. So right when we did this graph, I don't think I've shown this one yet. I apologize, this gets so exciting and I forget which one of these I've shown because I keep going out of order. I had a nice order for myself. All right, so let's disable our filter and we can see again, unsurprisingly, we have Debian packages make up the vast majority of these, right? Like Rails, 7,000 of the vulnerabilities are from Debian. So almost all of them, right? And again, this is just because some of these containers are using Debian images that haven't been updated in a long time. And that's why we're seeing this, excuse me. So now we yank out our Debians again, the Deb packages. And now we can look at like, where are these coming from? And so let's expand out our from five to let's say, let's try 500, does that work? Does work, great. So now we can take a look at where are the vulnerabilities coming from once we start filtering out some of the things we know are slightly less important or less interesting in our case. So we can see what is the green? The green is Java, right? So that's right there. I mean, I'm not surprised. I think Java and I guess Debian for that matter kind of suffer from a similar problem of when you've been a language for a really long time, there's a lot of history. And so there's a non-trivial number of projects that will pull extremely old versions of a package because they just work and my job isn't to keep it updated. My job is to make it work. And so that's just one of the things we see happen in these instances. And so I think that's not a surprise to anybody. What is the NPM? Again, I expected NPM to be way higher than it was, but it's not. So that's great. I have no idea what ghost is. RPM, not a big surprise. We could yank RPM out as well because obviously RPM is again, actually I can filter them out RPM and APK if we kind of yank those out because those are, again, Linux distributions. So again, not a big surprise. I think a lot of Java archives, NPM, what else is in here? Jenkins, I don't know where that came from. Gem, Ruby is a similar situation where there's just a lot of Ruby stuff. Okay, okay. Let me undo those. Let's disable that for a minute. And let's see what some of my other... Okay, we did that one. Okay, here's one. This one I like a lot. So there is vulnerability information, there's severity information and there is, is it fixed? One of the bits of data that GRIP gives you is if there's a fix available. In fact, if we look at our graph over here, or not graph, or I'll put over here from GRIP, it says fixed in. This is where if it knows there's a fix, it will tell you what the fix is for a particular vulnerability. Whereas there's been, for example, a release from Debian, there's a Debian security advisory, DSA and Debian fix something. And then GRIP will say, oh, hey, there's a fix available for this particular vulnerability. And so we look at this, excuse me, and we can see like in the case of Rails, what 4,000 of those vulnerabilities are fixed, right? Like that's a lot. 2,000 not fixed and 910 are won't fix. And this again is from that Debian kind of won't fix units. But you can see like, that's an enormous amount of stuff. Now, is it all Debs? I don't know, let's... Oh, wait, if we enable Debian packages. Or no, now we, I'm sorry, we disable Debian. If we disable Debian, Rails isn't even on here. Let's see if I can pull Rails back in. There's Rails. So... Yeah, we'll do that. There was a question to CNCF in the chat as well. We hacker asks about how to get to this program as well. I think you can email, for example, online programs at cncf.io to get started with discussion there. But amazing that our session today is such a hit that everyone wants to get on as well. Awesome, awesome. So, okay, okay. So here's where now, what is going on? Go away. Right, we've got Rails, almost many of them disappear, right? Once we take out Debian and we search for Rails and there's, it's... My elastic foo is weak. So that should have looked different. But anyway, anyway. All right, let's disable that again, right? And we can look at all of the... Let's expand this out. Let's do 500 again, which should hit them all. And we can see, like, okay, so there's some unknowns. We get unknowns when gripe just doesn't know what to do, right? And that happens sometimes where we don't know if there's a fix. We don't know if there's not a fix. It's just, it happens, right? There's just incomplete data. And this is an instance where some people might be like, oh, unknown, why even talk about that? It is much more valuable to me to have an unknown state than it is to just skip it and pretend it doesn't exist. So I value that, right? A lot of won't fix, not a surprise there. Not fixed, whoops, I disabled it. There's a shockingly small number of things that aren't fixed in my opinion when we look at all this data. And again, there's a variety of reasons for that. But then the amount of fixed, like, these are vulnerabilities that you could make go away in many instances just by running like an app to get update, app to get upgrade, you know, in these container images. And so like this is another good reason when you're creating your container images, like install the updates during creation because maybe the base image doesn't have them. Maybe you're not using a latest tag for a base image. You know, there's many arguments about pinning versus not pinning and all this stuff. So it's just something to keep in mind, right? Is when you look at data like this, you can see the effect that installing updates can have. And obviously, again, this is everyone needs to do their own thing, everyone needs to do their own research because what works for me might not work for you. And I'm very happy to admit that. All right, okay, cool. So we got through most of this stuff. I've got a handful of other fun little, I guess just bits of data I included down in the bottom. One of the ones I like is the CVSS score distributions. This is where we talked about these NVD vulnerabilities here and this is the CVSS scores. CVSS scores are basically zero to 10. 10 being the worst, zero being not a security vulnerability. And we have access to all the data. And so one of the things I did was I decided like what does it look like if we graph it? And it's a bell curve distribution. Like it's boring, like it is what it is. But now we can maybe, what if we take out our Debian packages? It doesn't really change. No one is surprised by this, I think. So let's see, what else do I have here? This one I liked a lot. So one of the things I did is I have the hash of all the files that are part of this data. And so I thought what happens if I, how many hashes are the same, right? And like this is an instance where remember those zero byte files we talked about? So zero byte files all have the same hash, right? Unsurprisingly, because they have the same data in them. And so these are a bunch of zero byte files that are causing this weird hash, but then there's other things, like what is, there's a file that 700 of them are the same. And so this is actually where like we can pin this again and let's go look at what this is. And so there's this thing called discover in Kibana. I'm in the wrong index so we can look here. Like what is this? It looks like it's man pages, right? We're getting this package from a bunch of man pages. Now, why are some of the man pages the same even though they have different names? I have no idea. I didn't look that closely at it. But like this is some of the just weird data you can find when you look at all this. Like why are they the same? I don't know. All right, let's get rid of this, get rid of this. Oh, and it should also be said this is, yeah, these are the file hashes. I filtered out empty and empty beans are just no file hash. This is because SIFT is pulling the file hash out of the packaging system. And if the packaging system doesn't tell it a file hash, it doesn't currently generate it. Thank goodness, because it would take forever to generate this. I mean, I guess kind of an interesting data point like that. So I've got my repo here, right? Where I generated all these S-bombs and these are the S-bombs from SIFT. And you can see like some of them are of substantial size like 54 megabytes, things like that. If we do, it's 1.8 gigabytes of S-bomb content. Like that's a lot of fricking S-bomb content. I mean, and that's just 150 some container images. You can imagine if you're in an environment with thousands or millions of containers, this can become a substantial number. So it's just one of the interesting data points I guess that I found amusing. And then let's see, the other one I liked was I took all the package versions and I looked at what they had in common. And this one here was another one of those like what the heck is this? And I don't remember anymore because I haven't looked at it in a while. So let's pin it, it's pinned. We can go over here into discover and take a peek. And that's because it is an S-bomb and it is something called opened liberty. It's a bunch of Java stuff it looks like, which suggests to me that there must be, what is the Java archive, Java archive package name? It's probably one of those instances where you have a one kind of Java source archive ends up generating a huge number of jars on the other end and they'll all have the same version. And so in fact, we can probably, I haven't done this so this might not work. If we look at our visualized library and we say, if we do packages in container, yeah, we can see there's two containers that looks like they pick this up. It's web sphere and open liberty, which is, if you add those two numbers up, it should be pretty close to what we had. So yeah, like this is one of those fun times where we can just look at some strange data and say, like what's going on? I have no idea. And I love it as it makes this so much fun sometimes to look at, am I out of graphs? I think I might be out of graphs. No, I'm not, no, we got that, we got that. Okay, right, that is everything I had. Does anyone have any questions? Is there anything you want to ask about? We've got a few more minutes still. We can dig into something else. That's a good time to, yeah, say essentially kind of final call. As you mentioned, we have time for a few questions, perfect, no worries. I think there hasn't been a question in a while, I'm saying a few minutes, but there was a comment from Western saying, I think it will currently end up being unknown anytime Grave falls down the NVD for a match a while ago. That was the latest, yeah. Very cool, thank you, Western. Western is, I work with Western and so he and I are very familiar with Grave. So thanks, man. Awesome. Even clarifications from the audience. That's always great to have. I love it. They're smarter than me. I'm not going to pretend. No, this is just a lot of fun. Like I said, I've got this repo here. If anyone wants to give it a go, by all means, he asked me questions, hit me up. You can find me on Twitter. I don't know if there's any contact details in the notes for this. And if there's not, I'm not hard to find. But yeah, I'm happy to help anyone out with this stuff. I think it's a lot of fun. I've got some other repos where I put the CVE data into Elasticsearch. I put GitHub data into Elasticsearch. I put everything I can find in Elasticsearch. It's awesome. It's like a drug. Once you start doing it, you'll never stop. So. There's a few questions now. Or there's a dog, the Lord, saying, Encore CLI is also one of the products from Encore and it helps build CICD build. Yep, yep. Yeah. That's right. That's right. Then Kurt asks, is Grave publishing the SPOM data on these containers publicly so people don't have to grind through them all and then never mind in that repo. Well, that's, so actually, did I check them in? I think I did check them in. No, I did not. I did not check them in because there, it's a lot of data and GitHub doesn't like a lot of data. So, no, I mean, in the context of that question, like Grape is just a tool and it's up to you to run Grape and then decide what to do with the output. Like in my case, I put it in Elasticsearch. It's there at the moment, like it's up to the user, right? Run the tool, get the data, do something with the data. And I don't check the SPOMs in partially because they change a lot. Obviously, I ran this data, it was maybe Friday, I think, it's the last time I updated it. And so if I run it today, the results are going to be drastically different because the SPOM content from Friday to now changes. And obviously, when you have a ton of SPOM content, you keep checking it into GitHub. GitHub yells at you if you use too much space. So I don't want to do that. Yeah, and then there's a new question. Is there an easy way to test two SPOMs from different points in time? I don't know of anything at the moment that makes that easy to do. I mean, that's one of the things I want to do with like this particular project is being able to put in point in time SPOM content and then also being able to like ascertain some differences. But yeah, yeah, that's a good question and I don't think there's an amazing answer to that today. But again, this is a very immature industry. I mean, we talk about some of the difficulties with vulnerabilities and like we've been doing vulnerabilities for 20 years and we're still really bad at it. Whereas like SPOMs, SPOMs are what? A couple of years old at this point. I mean, I would say it's been probably the last six to 12 months that it's been getting significant attention. So I think we're like, there's a lot of interesting things happening in this space. I mean, it's one of the reasons I was attracted to this particular project is like, this is really interesting data. What are people doing with it? Like, I mean, I would love to expand this to the top million container images or something like that, but for now I'll settle with 155, but. Perfect, I think that's about it for the time that we have today. We can have like a 30 same final words from you if you want to shout out anything or so. I mean, I think the biggest shout out is just like, get involved, you know, like Sift and Gryph are open source, you know, there's this little project, there's tons of open source happening around SPOMs and vulnerabilities and there's so many ways to get plugged in. I assume if you're watching me do this, you have some interest in some way with this data or SPOMs or vulnerabilities or whatever. And it's a ton of fun and there's so much room in this space. It's so new and it's so exciting to be working on something brand new like this. It's absolutely lovely. And I guess just thank you everyone for coming. This was so much fun to put together. I was ecstatic to have this opportunity and I'm truly grateful for the questions and the attendees and the CNCF and everybody. It's been an absolute treat. Thank you so much. Perfect, perfect. It's been absolutely lovely to have you. And so let's wrap it up for today. Thank you everyone for joining the latest Cloud Native Live as always. It was great to have the session here about live sifting through top 25 or even actually more in the session that we found out. A lot more. Yes, containers. So we really also love the audience interaction today. Thank you so much for all of the amazing questions and comments and even being some more clarification. And so that was really lovely to see. So we bring you the latest of Cloud Native Code every Wednesday. So next week we will have a session on optimizing Istia with EBS. So thanks for joining us today and see you next week.