 Okay, thank you for bearing with us once more. Good afternoon, welcome to Kubernetes AI Security Claybook. We will introduce ourselves momentarily, but thank you very much to the organizers for putting on another wildly successful open source summit in a distant corner of the world for us from London, and we feel very welcome here. So thank you for having us. I am Andy from Control Plane. We're a cloud native security engineering consultancy, and I'm proud to be involved with some other illustrious organizations. I work as co-chair in CNCF Tag Security, where we look to help assure open source and CNCF projects as they rise through incubation and graduation. We'll talk about some of the work that we do there later. I'm also CISO at Open UK, which is a charity which works to advise the government in the UK to try and avoid bad regulation. I'm very lucky to work with some other people who are here today, and various other things that I've been involved with over the years, and my illustrious colleague, Jack. Hello, yes, I'm Jack. I've been at Control Plane for a while now. I've got an interest in engineering, security, done lots of threat modeling at Control Plane. I did a talk yesterday at Open SSF Day related to software and the kind of pipeline for that. Today we'll be looking more at the pipeline for data and ML. I'm a package maintainer, and I'm on the INTOTO steering committee and helping out with that, so that's me. Wonderful, and for posterity, this is what Control Plane does, and these are some of the places we're very proud to be involved with, and with that, what are we going to talk about today? We'll run through a brief introduction to the AI ecosystem, and briefly talk about levels, the types of AI, different use cases, and then move to focus on the tooling and the mechanisms of the AI and ML lifecycle, so the delivery of models and ultimately inference. Based on that, we'll try and gain an understanding of the problem space of securing such a lifecycle, which, as we'll see, is a non-trivial task with a lot of opportunities for malicious users to inject problems, issues, poison, all sorts of things. Then we'll get into the most interesting technical parts where we will form a threat model of a hypothetical satellite communications AI system. We'll use a threat modeling methodology, which is something very dear to my heart, and navigate through the MLSEC problem space and identify risks and security controls to mitigate the most common and some less common threats affecting AI and ML. So, let's start with an introduction to the various different types of AI. The current and most basic level is what experts call artificial narrow intelligence, or weak AI. It's developed and designed to carry out a finite number of tasks, such as reactively answering human questions and can operate only on a predetermined set of data. Examples are Siri, Google Assistant, Alexa, and student's best friend, ChatGPT. The intermediate level of AI is artificial general intelligence, or strong AI, which is an aspiration for the next decade and theoretically fits at this stage. Examples could be HAL 9000 data from Star Trek or Skynet in Terminator. The major differences here are this is a human-like AI, and more importantly, can adapt to new, rather than operate on predetermined sets of data. And then we get to the truly hypothetical and academic level with artificial superintelligence. This is possessing superhuman transcendental properties, and it's characterized by a continuous sense of self-improvement. Examples are Ultron from the Marvel Universe, or Giscard from Asimov's novels. So it's a fairly wide problem space. The definitions are a little bit muddy, and there's a number of branches. So in order to clarify exactly what we're discussing and provide an overview of these, we'll talk about where ML sits in this landscape. Machine learning, and this is the use case we'll focus on, this specific branch of AI, focusing on the mathematical formulae and statistical models that computer systems employ to carry out operations without being explicitly programmed, delegated intelligence. Instead, they use machine learning to take what they've learned, and from the present context, generalize, generate new programs and new understandings automatically, essentially modifying their execution, and to subdivide even further, within the ML branch, we have three different types. Here you can see the typical techniques and use cases for each. We have unsupervised, reinforced, and supervised learning, broadly unsupervised learning, it's analyzing data without predefined labels and looking for patterns, clustering and unstructured data. Dimensionality reduction is an academic way of putting it. Reinforcement learning, which is how we've seen DeepMind do a lot of things with learning to play games, where an application observes this environment, has some feedback mechanisms related to rewards and punishments for performing separate specific behaviors. Game playing and robotic control are finally supervised learning, which humans actually label that data, and there is some intersection between the two, because, for example, chatGPT is supervised reinforcement learning, where humans intervene into the feedback loops to reach you in the model. So yes, they intersect somewhat. We've got classification, regression, at this point, we don't necessarily care so much about the differentiations between them. So that is the quick background to AI. Thank you very much. So yeah, if we're gonna look at an AI platform, an AI-based system, we need to kind of look at the entire lifecycle, so sorry for that. Yeah, this will be how you get from your data, you train it, creating the applications, deploying them, kind of everything in this space. Click. There we go. So yeah, we're gonna take a look at that process. So this is a cycle that many of you will be familiar with. If you're not, this is the kind of DevOps lifecycle, something we've had to look at many times over the last decade. It's the approach that people typically take to build applications, deploy them, keep them running, improve them. It's kind of your continuous improvement process. And when people went to AI, they were thinking, well, maybe we can get something similar to this. So here we've got the data ML loop. So this is obviously a very similar structure. We're gonna be collecting the data starting off. We're gonna curate it and process it and then we're gonna develop our ML. So the next level is combining the two because we don't just want a model. We want a model to be useful. So we're gonna integrate it into various applications. You don't just stare at a model for, say, chat GPT. You interact with it through a chat. That's kind of a key part. If you can't use it, it's not of benefit to you. So you need to combine these processes in order to build your model, improve it, deploy what you're gonna be using to interact with it and continue that process. So throughout the last couple of years, there's been a focus on security to do with DevOps. So many of you have heard the term DevSecOps. So that's the process of taking this section of the graph and making sure the processes work well and secure. But there's a big chunk of this graph that isn't covered by that. So that's kind of what we're gonna get into today to kind of secure the whole process dealing with data. It's a bit different, especially with massive data sets. You can't treat it exactly the same way as software. So there we go. Back over. Okay, so to understand how we're going to perform these security enhancements, we need to understand the problem space that we're dealing with. And so as usual, with newer and cooler stuff, we have more complexity, more layers that we need to assess and secure by design, secure by default. These are words now coming into ordinances and edicts from the Biden administration. We have the good old infrastructure layer that we know so well. AI is still running on TIN or virtualized machines. There may be custom GPUs or tensor processing units, but ultimately these are all quantifiable and reasonable processes, be it sort of standard or in vector form, but it's just infrastructure. There's no explicit magic there. But then we move up to the orchestration layer, which is used to actually define and execute the operations on the data and on the model itself. But it's equally important, and this is what we'll get into, to secure the data layer and associated operations. Because data plays such an important role in ML ops, it is, of course, the lifeblood of any organization, but even more so when we pump all our IP into one place and then put an API on the front of it. So it's crucial to secure these operations and finally, at the top layer, any application leveraging the models is an entry point for an exposure of the model which we need to secure. Of course, if an attacker is able to get in front of that, they can also falsify the results from the model, but we'll try and focus on those core bits in the middle. We'll assume we have a benevolent application developer with a reasonable expectation of security and that we're running on a shared responsibility model with shared infrastructure in a cloud provider of some description. At the exploding complexity of this is reasonably overwhelming, so we won't get into this so much now, but this is a high level view of the end-to-end process from data to end user. And of course, with every interaction between components on the entity model, there is an opportunity for hostile interaction. So the main challenges around securing MLops have to do with the complexity, the dependencies, and the data flows. And it's a multi-dimensional problem that encompasses, amongst many others, data security, model security, access control. Of course, there is a supply chain involved in all of these things and we solve this with threat modeling. Thank you. So yeah, threat modeling is something we like. And in this case, we've got kind of a massive problem. And when you start threat modeling, you really need to work out what you're gonna be doing. So we'll take a look at the hypothetical and you should start to get the gist of what we're focusing on. So in this example, there is a company. They have several spacecrafts. Various different satellites. And these satellites are sending massive amounts of data through to base stations. As you can see on the graph, this'll be insane amounts of data, thousands of data points a second. And then this gets sent through to a central location or maybe a handful of locations on Earth so that you can process it and you can see a little diagram there, a little dashboard. So the idea is they're using this data to produce dashboards about the status of these crafts in space. You can't exactly look at them to see if they're doing right. So you need all of this data and you need a way to go through it and find out if something's wrong because if something's wrong in your satellite and then if it collapses, that's a big problem. So on this slide, we've got like a little treat. This is essentially the kind of legacy process of trying to work out if there's any issues. So you can manually write a program or a policy that has various different ifs and elses to process the data and work out. Is there an issue? Is there something out of spec? But this is a kind of reactive process. This is very manual. There are gonna be things that we miss here. And then the company in this case wants to do something that's less reactive, more predictive, try and see trends to fancy machine learning stuff. But I'm more focused on the security part of these things. So this is the stack that they've got for what they're proposing to do for the machine learning. So we've got an infrastructure layer. You're gonna be running your computer on computers. We'll have an operating system, in this case Linux. And then we've got Kubernetes on top of that to orchestrate everything, distribute it. It's a common thing with machine learning that you need a lot of compute. And in order to distribute your jobs, you'll want something to do that. So we've got Kubernetes. And then on top of that, we've got the machine learning E-part because out of the box, Kubernetes doesn't really know what to do. So Kubeflow is a great kind of tool that you can use to distribute your TensorFlow jobs. And get your machine learning and training done. And then we've got the front end apps. So that's the part where you're going to be using the models that you've created. And then hopefully seeing that there's a problem and being able to address it before the satellite explodes. And then yeah, with threat modeling, you cannot look at every single individual detail of the entire universe. We have finite lives, so we need to do finite tasks. And that's why we've got a couple of assumptions and a couple of things we're gonna just completely cut out. So an assumption here is that we are dealing with pretty serious adversaries. So this could be nation states because we're dealing with a serious topic such as space and satellites. And then we've got the fact that the machine learning in this case will not be directly talking to the satellites. It won't have any control. So that's not something we'll need to think about in this exercise. It'll just be for the anomaly detection and the reporting. And then yeah, it's not gonna be talking to the satellites. We're gonna have the data traveling on the earth through these orange paths. And yeah, most of the lower level stuff will be out of scope. So this is another diagram. It shows kind of the architecture, the data, where it lives and where it'll go. So we can see at the bottom the kind of operating system. We're not gonna think about it today. We're gonna try and keep it to the higher level stuff, the stuff more relevant to machine learning. You could do, say, a separate threat model just for Kubernetes. It would apply to most things. We're gonna look at the parts that are just important to ML. The infra admin, there's gonna be some aspects in this threat model, threat modeling that will go over kind of aspects of their work. And then the important part is the person interacting with Kubeflow. So that's gonna be the people trying to do some of the access to the data. And then above that, we've got the data scientists that are gonna be trying to train the data. Train the model, sorry, from the data. You can see we've got some little platters. That's where the data lives. We're going to ingest the data from the orange line there. We need to then prep the data. We need to then train these pretty common aspects to machine learning. And then once we've got our model, we need to then deploy our application that we're gonna be using and monitor it. And here we go. So this goes a bit more into the data scientist job, what the data scientist is gonna get up to, and also where the data lives. So if we think about the process, we've got the ingestion, the cleaning, the normalizing, all that stuff, and what are the kind of dangerous bits, what are the issues that we can think of, what are the dangerous attacks that people could do? So if we take a look at this, we've got the data kind of coming in and being stored in our initial storage. At this stage, people could potentially mess with the data. They could add erroneous data, which could potentially ruin a model. They could modify some of the data to make things that potentially look bad look good. And just generally reduce the quality of our data. And if you put rubbish in, you'll probably get rubbish out. And then at the next stage, we've got kind of the cleanup process. Obviously, if our data is in a good state, cleaning it up should be a bit easier. But if that process isn't undertaken correctly, then you won't have correctly prepared data. At the kind of labeling stage, if people are able to put incorrect labels, that's also going to cause issues. And then towards the bottom, again, we've got the similar thing. There's both the storage points have the same risks. And then here we've got controls. So whenever you have kind of attacks and threats and risk, you think about how you can mitigate it or control that. So things we can do to protect against these parts is we can sign some data. We can't sign everything because just the sheer amount, like you might be able to sign every piece of software you're building, because it's relatively small, but machine learning data sets are massive. So you can sign a subset of that data and then you can audit a subset of it. And if you see some dodgy stuff, then that should tip you off. But yeah, you can't do it for every single piece of data. You want to make sure that the access to that data is least privileged. So in theory, you don't want any humans to have any access. You just want the systems to have the right access. And then systems that write should only write and systems that read should not be able to write to those data sets. And then networking, there's kind of some parallels to do with storage, especially when you've got something like Kubernetes where you maybe aren't physically mounting the data. You're probably in a data center with the centralized storage that's on the network. So similarly, privileged stuff applies to networking with data. And then go back. So on the cleanup stage, we want to make sure that the data scientists, the humans that are interacting with this data, the work that they're doing kind of gets tracked. If they're doing specific work, you could say keep track of that and then have someone sign that they did that work. You can look at the kind of permissions on the tools they're using. So in this case, it's Kubeflow. So you can make sure that Kubeflow has only access to the right things. You can make sure that your users only have access to the right parts of Kubernetes. And then you can also do kind of spot check reviews or manual reviews of the processes that these data scientists are going through to make sure things are accurate. And then deeper into the storage, on the pipeline storage, we've got some extra policies to make sure things are correct. And then we've got the training and the tuning. So this is slightly further down the life cycle. And again, if we think about problems that are possible, we've got various different tampering again. You'll see tampering a lot as a risk when you're dealing with something like data because you want to keep it to a high grade of quality. As I mentioned before, you can just put complete rubbish in and poison the data set. And then, yeah, if you have this kind of pipeline that is doing these automated, it's not builds, but kind of training sessions, you want to make sure that those are provisioned correctly. And if we go to the controls, we can save for the scientists that are interacting with the platform that they are authenticated and not just authenticated, would say passwords, but you want to have MFA. So physical keys are pretty good. Pass keys is kind of a new concept to replace passwords, which is something to look into. At this point, you might be like actually writing down various different weights and things to apply to your model as configuration in a code base. So if you are kind of committing code, the same things apply as with a developer. You should look to sign commits. You should have the two keys principle, which is that essentially you have peer review. So no one person can merge a change to the weighting. And then, yep, we've got the same kind of policy stuff around this, the storage and the access to that. And then I'll pass it back over to you. Thank you very much. Yeah, as Jack says, with such a high volume of data, then you can just map it to the CIHR ad. The confidentiality of the data is often very important when there's a weight of intellectual prophecy that's potentially in these models. The integrity, as soon as something is poisoned, maybe it's as simple. It's not as simple as an office space-esque putting fractional pennies in someone else's account, but it's far more insidious and surreptitious. You often can't extract training data from the front of the model, so you can't quantify why it's made a decision, which means that that poisoning becomes almost impossible to reason about. And then the availability, you can probably tell if the model's offline, but those two initial pieces are repeated problems across this problem space. When we get to deployment, we have been through the process of taking that core data, training, tuning. We've got something that we believe is operational for whatever level of test integration we perform, probably a manual step. It is difficult to get typical deterministic testing out of non-deterministic models. So it is more about sampling. There's a sound right. You can see how difficult this is by the advancement of chat GPT between three and a half and four. It looks like sometimes it regresses in quality of output. Same thing with Google's bar, there's the two big ones right now. So testing just means, well, okay, let's just ship what we've got right now. Hopefully it's improved in quality. And we get to this question of serialization. So we have a model that exists in a training set or in a training environment, and we essentially turned it into something that can be deployed into a different environment. And this trained model is converted into files which represent the model and then are loaded into an environment or an application that allows them to be queried and used and the inferences questioned and gained. This is a point of compromise for the entire system. I mean, as with any pipeline, it is cumulative and builds upon previous stages. So at this point, this is kind of like a packaging step for an open source application. You compile something, but that is distinct from packaging it and pushing it to NPM or Maven Central or whatever it looks like. So a lot of these things are analogous to the way we build software. We're just building a much bigger thing without the universal possibility of actually compiling it because of the prohibitive costs of compute. So we cared deeply about mitigating these threats in particular. What else do we have? So if we start from the serialization point, again, this tampering or poisoning, tampering might be surreptitiously altering the data. Poisoning might be just, as Jack says, putting some junk in. Maybe that reduces the quality of output. Maybe it, again, from a sampling perspective, reduces the percentage of times you get the correct output. They're essentially the same thing for the purpose of this, although certainly different. We actually want to test the model somehow or other if we have, for example, a rogue model that is able to answer the test suite because it's been pre-trained on the actual test data, but then in sort of wider usage, it returns malicious data from our kind of satellite perspective that could be pretending that all the telemetry is good when it's actually been taken under control of a tractor beam or some equivalent spacey related thing. We're looking down at the actual image registry, again, for tampering and poisoning. The serving of the model then, hold on, let me just work through this properly, got another storage question with tampering. We're getting to serving, so this is where the application that the user is querying to interact with the model is responding. So as I mentioned at the beginning of the model, we can, if I was an attacker, and I was able to get into the request path, so be that intercepting an HTTP request, sitting behind a terminating TLS middleware box, just intercepting traffic and rewriting it, invalidates the whole thing, but that's standard, that's no different from anything else. It's the question of trust that is put into these models to actuate or make decisions for people that makes that a little bit more, again, nondeterministic and difficult to detect from the front end. And that model abuse via unsecure ATTPM point assumes that at some point we're behind the firewall that's performing our termination. We're assuming that the transport layer security is fine in front of that. And finally, that's serving to front end point where the model is misleading via data input. Again, that's from back in the model as the cumulative pipeline builds. So the point being for these is that we get back to things that we recognize from supply chain security. A lot of what we've spoken about, we can look at as a kind of ML ops with Salta as in the supply chain security attestation levels. The supply chain security pieces we do are related to signing. Cryptographic signing is the person in charge of that key at that time, sufficiently liked the thing in question to give it a stamp of meta approval of some description. The quantity of data that we're talking about here makes it very difficult to do this for each and every piece, as Jack's pointed out, but we can perform this as interstitial signings. We can sign evidence about training and log data and we can sign the hash of the model files, for example. So we know that as the data is passed along the chain from producer to consumer and in that continuing chain, then the provenance of the thing being passed along is sufficiently validatable that no one can intercept it and get in the middle there. We have, yes, digital, oh, hold on, let's go on the left-hand side. So digital signing and evaluation of attestation in Kubernetes terms, this is admission control. We care about is this thing signed and do we trust the person that signed it and is the signature reasonably fresh? Is it within an expiry window that we trust? Because, of course, if we look at another classic supply chain framework, the update framework, which in total in the update framework, similar in the way they deal with these things, a freshness and recency attack of a replay of a piece of software that is, at the point it was released, it was signed, it validates everything, but if it's two years old and someone gets in the middle and replays that, it might pass all of these checks if we're not also checking for the freshness and recency. Often we see keys that don't expire for non-trivial numbers of years. Through into testing, again, the digital signing of every component that we pass through just gives us that one extra layer of confidence that the thing that we've received is the thing that someone else meant to give us. Admission control, of course, test results, attestation, this is something that is often not necessarily bypassed but not given due care and attention, that the first real supply chain consideration framework, let's say, for Cloud Native was in the form of Graphius, which came out of Google, based on their binary authorization bits and bobs, and it took them about two years to argue that they needed a way to sign abstract test evidence, because many times, especially in highly regulated industries and environments, somebody is probably doing a visual check or taking screenshots of things, and there has to be a mechanism to recognize that in our fully automated pipelines because, unfortunately, not everything is automated and built with an API. We get to, yeah, of course, requests throttling age to be traffic sanitization while denial of service for a model, it's not really that serious. I mean, if it's critical, it's not gonna be on the public internet. HTTP traffic sanitization kind of depends where you are. You'd also want, of course, intrusion detection as the last line of defense along with preventative and detective controls, some automatic mediation into what Kubernetes can do by default with Cloud Native application platforms and probably the most difficult thing in the entire talk, input validation and sanitization, it looks like because AI models or large language models are built to emulate humans and humans can be conned by con artists. We're in this catch 22 where we might not be able to actually prevent people from messing with prompt injection at all. So, TVD in the top right hand corner, I would scope that outside of the threat model out of capacity fears. Right, so that is the high level of how things are terrifying and there are ways to fix bits and bobs. Kubeflow is a reference implementation of an AI pipeline tool belt. It's multifaceted. It can do things other than TensorFlow. It's a significant project and so warrants the threat model itself. So we need to scope it down or even better do it at the higher level of abstraction. So this is a data flow diagram as we saw earlier for a Kubeflow TensorFlow reference architecture. We will not go into too much more detail but pipeline components here, the self-contained sets of code that perform one step in an ML workflow, such as pre-processing data, training the model, et cetera. And we must build the components implementation that describe the specifications. The component implementation includes the executable code and the container that the image is running in, interactive environments for writing and running the code, creating visualizations, narrative formats. It's widely used for data analysis, model development and research. And the call to action, if you are interested by any of this stuff, tag security is threat modeling Kubeflow as one of the next pieces of work we do in the community. We're currently doing the vSphere CSI driver. We've recently done Flux and Argo CD. One of the greatest pieces of work we have is the Spiffy Spire threat model that we've done before. So this is what we do for the CNTF in terms of giving the technical oversight committee a security and assurance based view on the projects that come through. This is a reasonably significant piece of work that we'll be doing. Obviously we've had the immense pleasure of doing some preparatory threat modeling. Kubeflow have just stood up a security team to interact with us as well. So yes, exciting times if this kind of thing floats your boats. And there is a non-exhaustive list of further threats that the system may be vulnerable to. This is what we do for a living. If you would like to talk to us about this, please do find us afterwards or on all the various things that exist. So threat model everything. We can't quantify our security controls if we don't understand what we're doing and unquantified security controls lead to systems that end up as constrictive straight jackets where people in maintenance mode are unable to make a single change to the system because some disembodied security team before them has implemented controls without a pay for trial, without quantified, qualified controls. Threat modeling is a glorious way of preventing that particular organizational anti-pattern. Common threats still apply to AIML. There's not very much here that you wouldn't expect. There are some slightly wider pieces about the massive scale that these things work on, but ultimately they're still computers and it's still supply chain security. Configuration is part of the supply chain and configuration of the model is very much part of the outputs and the efficacy of the usefulness of the thing. These threats can be complex. These are some big threat models. Threat model Kubernetes is back in the day and it feels slightly more cognitively challenging. Just jump into this. Of course, collaboration is always required and keep up with regulations if you want to stay relevant because we're not really sure what they are. And with that, there's a free book there. Thank you very much for your attention.