 Good morning, everyone. I'm going to get started here in just a second. Give you a little bit of an idea of what we're going to go through today. The presentation part of this session could conceivably take an hour and a half. I've actually done it that way before. I don't want to bore you with an hour and a half worth of me rambling on. If all goes well, I will have a demo at the end to show how some of this actually plays out in the real world. If we lack time to be able to pull that off, then come down either to the Black Duck booth or, assuming that I get it all built up, the Red Hat booth. And I'll be happy to show you how this works in the real world. They're in the middle of building that up. So the company that I work for now is named Black Duck Software. And we've been around forever, 14, 15 years. And we're about open source security. And that's the last of what you're going to hear me talk about Black Duck. You'll see the logo throughout. But this is a project that I started in on about this time last year to figure out a way in which we can collaboratively understand what trust means when you start talking about containers at scale. So a little bit about me. Right now, I have the title of Senior Technology Evangelist, which means that I get to grind code and figure out new technologies and new ways of solving problems. I've been open source for ages and ages and ages. And most recently, I was a community manager within the Zen family of products. And some of the cool things that I've done up there, I'm happy to talk about them at length, but they're not part of this talk. And of course, you can find me at all the usual social places. Since I never know exactly what the skill level in the audience is, like technically this is an intermediate session, but that's your own subjective interpretation of it, how many people here are new to container orchestration systems? You've not really played around with them for very much longer than, let's say, the last couple months. Any hands? OK, cool. So the first part of this talk is going to be more of a grounding as to what the problem space really looks like. Second part of the talk will cover some of the general use cases that are out there and what can go wrong if you get it horribly, horribly wrong. So I'm going to do a bit of a review of what the container build process looks like. So I'm going to assume that the organization that you're working in has some form of Git repo where source code lives and that there's some kind of approved binary repository. That could be from JFrog. That could be from a Docker trusted registry. That could be Nexus. That could be something that you've built, whatever. And that for practical purposes, when you're assembling the containers that you're going to ultimately deploy, you're merging between these two things using a Docker build mechanism. And maybe you have a source code trigger that's going to have this such that you're part of a CI loop where the ultimate artifact that you're creating is something that you're going to have. And that's going to be a Docker container. And that eventually you're going to test this on your machine and it's going to be pushed into some registry. And when it gets pushed into that registry, it's going to be tagged. And so I've got a web app here. I'm going to tag it as 1.0.1. It's going to have an implicit tag of latest because that was the last thing that was pushed. And it's also going to have what's known as a pull spec. And so it's a SHA-256. And I've shortened it here to be 1, 2, 3, 4, 5, 6. But for practical purposes, these three monikers are how you can refer to this image that was just created. And depending on how you've grown up in the world, you may have some mix of these things there. And that last one might be something that's a little bit new to you. And so I want to kind of walk through how this whole process comes into play. So this is a Docker file, which I'm assuming that you've all seen before. This happens to be for one of the container images that we ship as part of our Opsite solution. And there's the base image from which it's coming from. So a tag of CentOS 7 out of the CentOS project. Little bit of software hygiene in here. I'm running an update minimal for all the packages that are in there. Then I'm copying in a bunch of stuff that's relevant to my application itself. And I've got an entry point that's going to launch this. Your Docker files are going to look very similar to all of this. Some variation, of course, based on what's going on. But a template is formed. And if I look at the Docker images on my environment, I will see that I have a CentOS with a tag of CentOS 7. I'll have this hub OSC scanner thing that this Docker file was referenced to. And I'll have an image ID that's in there. I have a 196E. That's the image ID for CentOS that was pulled in six weeks ago. And this was actually on my development machine at the point in time that I created these slides a couple months ago. And 16 hours prior, I built this hub OSC scanner, gave it a tag of 4.2, and it ended up with an image ID of 395. That six weeks is something to be aware of. Because effectively, what's happened is that the local Docker engine has cached this. So everything that's happened in the outside world is not reflected here on my build system. Any patches, any anything. So something to have in the back of your mind. If I look at the Docker image history on this hub OSC scanner tagged as 4.2.2, I see, reading from the top down, I've got my entry point. That's 16 hours ago. Two days ago, I have these file copies. That's kind of interesting. That lovely YUM update that I put in there to make certain I have the software hygiene, well, that's actually seven days old. And then six weeks ago, I see the pieces coming out of CentOS 7. So that gets me into a little bit of a weird security state. Because I believe that I'm doing the right thing from a hygiene perspective by having this YUM update. But it actually is building in a layer that is a little bit on the old and crusty side. Lord only knows what's happened in the last seven days. So that becomes a key problem when we start looking at what the security of our systems are and where the trust comes from. So I thought, when I was building this image, when I first started out, that I was actually going to execute that YUM update exactly the same way that I would have inside of, say, of YUM. But that's not the way the Docker file processed it. And it increased my risk and decreased my trust. Now, again, if I take a look at the image IDs in here, I had that 196E. That's the six weeks ago. I pull that back. I see it's 196 meg. I then go and I take a look at that individual layers in there. And I can go and see all of the subordinate layers that are pulled in. And they all have their own individual timestamp and cumulative size. So what's effectively happening here is that the Docker layered file system is coming into play and saying, I'm going to cache these layers because you haven't told me that they need to be changed. And the way that Docker says they need to be changed is if the text in the Docker file actually changes. So I have this lovely command and I'm expecting a behavior. Now, moving this into an orchestration system, that now gives me another scenario. So I have a cluster of one node. Give you a nice and simple, world's a happy place. I'm going to pull and run the latest web app. It's going to give me this kind of orangey color. I'm going to scale that up to two replicas. What's happened is that the Docker engine on that node has this thing cached. And we'll just go and say, give me another one of these latest, I already know what that is. I'm going to pull and run this 101. He's got kind of a pinkish tag. I'm going to scale him up. I'm going to go and pull by that pull spec of SHA-256. And I'm going to scale him up. And then I'm going to, in my registry, go and delete that 1.0.1. And I'm going to scale that 1.0.1 to three replicas. This works because the cache is on that node. It doesn't have to go back to the registry. I could have renamed it because tags themselves are human-definable mutable things. So if I wanted to go and push something else and re-tag it as 101, I can change the behavior under the covers. Now let's add a second node in. And then now let's kill that first node. Now the orchestration solution has survivability on the images, but some amount of time has passed. So that latest is whatever got pulled down as latest at the time that I started that second node and enforced survivability. It could very easily be different than what you started out with. That pull spec of SHA-256, 123456, is going to be the same, assuming that hasn't been deleted, of course. And so now you end up in a scenario where, as you scale your infrastructure, the trust of your nodes and behavior of your applications can change. That's not quite what you expect. So the question is, how do I know exactly what I've got? So deployments and triggers kind of play into this. And this is deployment config. And the objective here is to abstract away replication controllers and pods and replica sets and daemon sets and so forth, such that you end up with consistent behavior. The goal is to define a set of behaviors for when to have a change in state. So if I go and I say that I want to have when an image changes, I'm going to have underneath the covers a mechanism by which I can now say the tag on this, a.k.a. the pull spec, is different than what I knew it was at the time. And I will pull that image down. Similarly, I can have a config change. So you make a modification to a config map. This allows us to define things around how we're going to patch systems, scale systems, and rollback systems, and for practical purposes defined when this whole system is ready. And so that's the base paradigm that we have in place. And of course, we've got rollbacks because no roll forward is complete without rollback. Now, I've done the majority of the work on this project with OpenShift. And I started out with OpenShift for no other reason than it was a nice, prescriptive environment where I knew that things were going to be locked down so I could go and experience all the security pain up front as opposed to scrambling when I'm talking to a customer. And one of the things that's interesting in an OpenShift environment is this concept of build pipelines. And it looks very much like the CI CD environment that we're all familiar with where we've got, say, Jenkins or Travis or Circle or what have you as my build mechanism. And it just happens to be an embedded Jenkins environment. It has a source control trigger such that when I go and define a Git repository, it will go and create this container image. It will go and move it into an internal registry. It'll run some number of steps of pipeline security scans that I've defined, staging tests, and so forth, maybe promoting into a different registry. And then once that promotion occurs, a deployment trigger that goes and says, gee, Wiz, but I want four of these. It's pretty neat. Takes a little bit of getting your head around. But it gives a mechanism by which I can have a view of what the state of what I should be deploying looks like. Now, if I'm going to do this with an AB test, let's say that I have two nodes, two containers that are running that's on the right-hand side. And there's now a new vulnerability that happens to be associated with that container image that is underneath those containers. Well, I'll go and I'll patch it. Let's say in source code, put a new dependency in place, pull the patch from upstream, what have you. Have that source code trigger go and automatically create the patched image, push it into the registry, have that deployment trigger kickoff, define that I'm doing an AB scenario, put that patch thing in place, do my testing, and replace it. And now I've got a very effective patch management model. But that doesn't necessarily gain me anything with respect to trust. And so I have to take a little bit of a step back. And you can tell by the blonde in my hair that I've been around for a little bit. And I've been through a bunch of good ideas. I've been around a bunch of bad ideas. And for practical purposes, the one that I embrace right now is varying names, but security-driven development. And I'm going to put a set of assertions up there around developers with how security information should be consumed and how tests should be created. Release policies around security are in place. And trusted components are all in there. You've got security testing as part of your CI loop. Binary artifacts are only created when release policies are met. The binaries themselves are digitally signed so you know where they come from. Container images are built from trusted images. And they're only deployed from trusted registries and so on, so on, so on. And what can possibly go wrong? And this is kind of the bulk of the talk because what can possibly go wrong is a question of scale. So my first customer for this, their definition of scale for their acceptance testing, 40 nodes changing images at a rate of about 100 a day. If your definition of scale is more than that, please raise your hand. It's about half the room. Cool. Now, the problem is that everyone has a slightly different definition of scale. And it really is one of these. You know when you get there. And it's typically one of these scenarios where you've got some form of release management process and config management scenarios in place. There's some government regs. Service delivery includes terms like high availability, disaster recovery, multiple data centers, service level agreements, and so forth. That's scale. Proverbially, that's when it hit the fan. And so I love this prescient article. So this came out in March and said that the easiest way to get fired in 2017 was to have a data breach. And little did we know that we're going to have a rather substantial one this year. But before that one was disclosed, we had a lovely report out of IBM and the Panama Institute that put some stakes in the sand. The average cost of the data breach being over $7 million. The lost business associated with it being $4 million. And importantly, that 206 days, that's over half a year, for people to recognize that they had the breach and contain it. That's the industry average. That's not so good. And then along comes Equifax. And well, we know a little bit of the timeline, and I'm actually going to dive into that a little bit. But it can get slightly worse because, well, they had a couple of days later, 36% decrease in their stock price. Ouch. And bad things happened. So the idea is that we should always be questioning everything. There are new data regulations on the horizon. If you've not heard about something called GDPR, it's an EU regulation that starts in May. We're in the US. There's a certain level of, well, that's a European problem. We don't need to deal with it. No, we all need to deal with it. At some level, we're touching a European person, and the penalty is 4% of our organization's revenue, or 20 million euro, whichever is greater. There's already been determinations that some organizations, because they touch data, could be implicated in all of this. If you're not already aware of it, make certain that your organizations are planning around it. So question everything. So where does your base image actually come from? Who owns it and who updates it? What is the health of that image? Why should I be trusting that? If I'm building it in my build environment, should I be trusting my build servers? Is there any way that a foreign container can start in my orchestration platform? Who has the rights to modify these things? What happens if that base images registry goes away and I need to patch it? What happens if an update mirror goes down? When a security disclosure, what happens to my patches? What happens to my system? Who and what is the procedure under which patches are deployed? And if you come out of this saying, good Lord man, my brain hurts, this is an awful lot of work and yeah. So let me explain to you a little bit about how the world works for software development today. So we're at an open source conference. The majority of organizations still have processes around proprietary commercial software. They expect that the procurement manager has some level of relationship with the vendor. That when something goes end of life, that they're going to be notified that, gee whiz, but here's an upgrade. It's gonna cost you a boatload of money, but what the hey? Or here's a set of patches and we know that you're kind of in your sunset of this version, but why not? Defined security processes, all part of how the proprietary software world works. Open source is very different. Open source as most of you know, if you're not engaged with the community, you know nothing. The community could decide that the version that you're using was the worst possible implementation of that functionality on the planet. And they collectively went off to the beach, got themselves a bunch of margaritas, had a great old time, came back and said, yeah, we're not gonna make those same mistakes again and create a version two. And if you didn't know that that happened, and you want to update version one, you're effectively owning a fork. Security disclosures don't look the same way. Throughout this talk, I'm gonna have a bunch of things that I highlight, and for the people in the back of the room, I'm gonna actually read out what I've got highlighted. This is a security patch release note from MediaWiki years ago, and this is pretty typical for how these things play out. So the first bit of yellow says, this is a maintenance fix, very special pages resulted in fatal errors. That's their security disclosure, which is incredibly helpful if you're a MediaWiki admin. They also put a note about End of Life that said, hey, please note that 124.6 marks the end of support of the 124 series of releases. Technically this ended a few weeks ago with the release of 126.0. However, 124.5 had issues along with other versions so it's not fair to fix them. That's a really cool thing. That's kind of how open source works. We're all trying to make certain that the world is a good place for our consumers and do the right thing. And you'll note that this is a 126.25, 24 and 23 release. That's not how most organizations expect to consume security notes. They don't expect End of Life notices to be embedded in them. They expect some level of awareness. So the onus is on us as the consumers to kind of do the right thing. Now, I mentioned Equifax, and one of the cool things that I like to do is go and take a given vulnerability and decompose it to say who knew when and what was the real problem. So the whole Equifax scenario, if I go back to August of 2012, that's when the code that ultimately became this bug was introduced, oops. And that's what it looked like. Those are the lines. And specifically, there's a bunch of checks up front in this build error message routine and then there's a return of a localized piece of text. That's it. That's the piece of code that ultimately became the problem. In between that, Struts 2.3 was released in November. Then in 2016, Struts 2.5 was released. All of the same branch of code, 2.3 effectively became a maintenance fork. Then we have the disclosure and the patches available. And those of you who are familiar with Git formatting will note that there is one line changed. That was the patch. And that one line was that localized fine text could actually return null, so we needed to test for that and make certain that we're returning the right things. And if you go back and look through all the history, that protection code that was originally there kind of devolved over time so it wasn't actually protection code any longer. And that's a scenario where code that works can actually become problematic when you end up with a review that looks good to me. Because unless you've got the tests in place, you don't know that what was working is in fact changed to become what's broken. Now, so that's the 6th of March. The disclosure is published as in media is aware of this and a week after that, the National Vulnerability Database where more people get their security information than Hacker News, and that's a whole separate talk, happens to have information on this. Up until that point in time, the National Vulnerability Database for CVE 2017-5638 was a placeholder. No information whatsoever. Unless you were actively engaged with the community, you knew that there was a something but you didn't know what the something was. So at this point, trust has been broken. That container has no longer a reason to exist. Now, the first time I gave this particular talk, I happened to be doing it for a US federal event and it just so happened that Richard Smith, the at that point now former CEO of Equifax was giving his prepared testimony before US Congress. That's something you never want your employer to be doing just so you know. And I'm gonna again read the highlighted elements. On March 9th, Equifax disseminated the US CERT notification, AKA the notification from the NVD. Internally by email requesting that applicable personnel responsible for Apache struts installations upgrade their software. We now know that the vulnerable version of Apache struts within Equifax was not identified or patched. It might have been. It could have been regressed and rolled back. You never know. On March 15th, Equifax's Information Security Department also ran scans that should have identified any systems which were vulnerable. Unfortunately, however, the scans did not identify the Apache struts vulnerability. Prepared testimony, October 3rd. So we've got a bit of a timeline here. Now May 13th, some hacks were truly successful and it took them to the 29th of July to figure that all out. And for those of you who wanna take a picture of this, this is the point where it's a good time to take a picture. We had almost 1700 days from the time that the code was introduced to the time the patch was available and various morphs of that code. Seven days delay in the national vulnerability database and 144 days from the time that the attack occurred to the time that it was discovered and mitigated. Equifax did a better than average job of figuring this out. Kind of a bit of a problem. So how can we actually do a little bit better? We need to be establishing trust. We need to be doing this at scale because problems are going to creep in, problems are gonna be moving through it. So we have to first define that security is a layered process. And it starts with understanding the role that security tools play in defining trust. So if I have the realm of all possible vulnerabilities, the first scenario is I have some static analysis tools, I have some fuzzing tools, I'm doing some injection analysis. And for practical purposes, this is all going to focus on code that your organization creates. You're gonna have one hell of a time going to your CTO VP of engineering, VP of ops and saying, I wanna go and put this really expensive scan tool on the Linux kernel. Open SSL, Tomcat, what have you. He's gonna say, but that's somebody else's problem. And he's right, she's right. That's looking at your code, not upstream. Vulnerability analysis or dependency analysis or software composition analysis depending on the term that a various analysts will use. That's looking at what is in fact in your environment. So I got 3,000 disclosures in 2015 found by researchers. I got 4,000 plus in 2016. As of the 1st of November, over 13,000 disclosures this year. That's about 35 a day. In your organization, how would you go and determine whether or not your infrastructure today is vulnerable to a given CVE? You'd have tooling for that. But unless you're continually updating that tooling and running it on a continuous scan basis, looking back at the history of time, you might miss a few things. And that's where things get really problematic. So I've got a couple of other decompositions. How many people heard about the Maraiva? This was the 620 per second attack on Krebs on security last year. Oh, not too many people. Well, this was an attack that came through a bunch of internet of things devices, doorbells, nanny cams, thermostats, fridges, internet enabled toothbrushes, Barbie playhouses, all kinds of ways through it. It was a vulnerability that was originally disclosed in 2004. And if you read the security disclosure for it, it does not describe anything like what the world looks like today. In fact, if you look at it, it talks about open SSH allowed TCP forwarding. If you go one step further and look at the man page for it, it has the most ominous statement that you ever have. And if you see one of these things, run and run fast. A statement that says on this flag, this is not a security vulnerability. Let that one sink in for a little bit. This would only be exploitable if you were connected to a public network and used a well-known default password. Admin admin work for you, admin password. That was what was used for a lot of these devices. Now the cool thing in all this is that if you read through it, you don't necessarily apply it to what the world looks like today because the descriptions don't match up. And that's kind of true for a lot of these older things. The same vulnerability that befell Equifax, befell the Canadian Revenue Agency, Canada's version of the IRS. Difference in behavior is they detected it pretty quickly. They detected that it was in their e-file system. They went on the evening news and said, we have turned off our e-file system in the middle of tax season because we have found a vulnerability that could disclose personal information for every single citizen such that your identity could be compromised and we're going to fix it. We'll let you know when we've turned it back on. And if we need to, we can extend tax season because, well, that's kind of our job. And difference in behavior between the two. How do we have responsible disclosure? And then the last scenario is why please God? Are there still 200,000 publicly accessible websites that are vulnerable to heart bleed in 2017? Should we not have learned something by now? The moral of this story isn't so much about heart bleed, it's about the long tail of systems that we manage. We all have legacy infrastructure. It's all part of our trust scenario. Bad things can happen. We need to be continually understanding what it is that we own and reassessing that trust. And for me, that's focusing on factors that truly impact risk. I've talked about vulnerable open source components. Forks, forks of forks, code ends up in your environment from a multitude of sources. There are something like 1200 forks of open SSL. Which version are you running? I got five minutes left, so this is gonna be fun to blast through a few more slides. Impact of point in time decisions. I mentioned about going to the beach and having a great old time. But what if that beach experience was version 1.2 of a given component happened to be no longer going to be developed anymore? And so there's this massive change set in your future to go and consume version two. How would you know? What kind of impact is in that scenario? Is there going to be a security patch for something that you've got? What does the commit velocity look like? We don't patch containers, we re-spin them, but we still have to look at how patches work. So again, looking at the same Equifax patch, if I move one version further along, it fixes the vulnerability that befell Equifax, but it introduces new issues. So is the patch version worse or not? Subjective. And there were three other scenarios where the patch version actually is worse than where you are, at least in terms of quantity. I mentioned the product that we've been working on. I'm going to put this into an OpenShift context. So for those of you who don't know, OpenShift is a fairly opinionated version of Kubernetes. Supplies a bunch of things that you would have to make the decision as to how to compose your environment. One of the things that they add in is an integrated registry and a mechanism known as an image stream. And for practical purposes, an image stream is nothing more than a mapping between that registry representation and a bunch of tags for where the image comes from. Just think of it that way. Of course, you can have external registries that are managed by this and some of the representations will end up inside the internal registry with the image stream mappings. What we did was we put a piece of glue in the middle and we monitor for every single scenario under which an image can be utilized within a cluster. New image being created through a build mechanism, imported into the system, utilized within a pod as a container or a knit container really doesn't matter. When it comes into existence, we're going to figure out whether we need to scan it we'll do all of the analytics that we do and I can talk about that at the booth. We'll assess our policy. The important thing is we'll throw some annotations back on there so that you can do something like a kubectl described pod looking for that label and then say, wait a minute, I'm running say 50,000 pods right now, which of them are vulnerable to this new thing right now? Simple kubectl command, simple oc command. oc being OpenShift's version of kubectl. Get that all in place and then have notifications so that we can get to a magic state and the magic state for me is within say an hour or two of a new vulnerability coming out regardless of how many nodes are in your cluster regardless of how many pods are running regardless of how they're composed you can answer the question of your boss are we impacted by this? And get that answer that quickly. That's the goal. And so if fundamental is a case of sources of truth and identifying where things are and maintaining a security source of truth so that you know where things are going in. We're not the only source of truth and what I'm asserting is that there are other tools like those static analysis tools, those fuzzing tools that need to become sources of truth. That might be some of what we see from Grafaius over time. Today, still very much more of a spec than anything else. Layer the container security pieces in place. In an open shift environment, you've got a very locked down host running a minimal Linux. You've got something called OpenSCAP that can go and assess the policy and security state of that host. With us, you can take a look at every single container image and what the open source risk is associated with and bring that all back so that it's automatic. There's no human interaction required because everything requires end to end visibility. Otherwise, things can fall through the gaps. During his testimony, former CEO of Equifax said, this guy or this team was responsible. Well, maybe they just regressed and rolled back something that hadn't been patched because the patch broke something. This happens. You wanna be able to know so that you can assess that risk. So I probably have time for just one question with one minute left and I'm apologized for not getting to the demo, but like I said, I can do this in about 90 minutes. Any questions? Yes, correct. The solution is focused on containers. That doesn't mean that it wouldn't work for say virtual machines like within an open stack environment or a cloud stack environment or AMIs within AWS. It's just, that's not what we designed it for. Cool, thank you everyone.