 Welcome everybody and welcome to Kubernetes ML sec securing AI in space James, and I will introduce ourselves momentarily for now. We would like to thank very much CnCF for another spectacular Cubecon this year here in Paris Right, I'm Francesco, I'm head of technical solutions for control plane. I started my career in 2010-11 as a security engineer for the European government deployed in London Looking at system security, network security and data security Then in 2015 I joined the immerse at the satellite mobile provider. I was Chief security engineer for the satellite control center. Then I changed position head of security operations engineering So my team would look after the technology stack Enabling security operations 24-7 for the company and then I moved as a head of security engineering So it was a bigger team My guys would also consult the internal business units on their crazy projects And then I didn't want to see a data center in my life again. So I joined control plane We are a cloud native security consultancy as we will explain later and James over to you Good morning, everyone. My name is James Callahan. I'm a principal consultant at control plane I have an unusual path into tech. I started my career as a theoretical particle physicist However, ten years ago. I got interested in cyber security. I saw how much bad code was out there Especially written by physicists myself included and this drove me to a cyber security I worked for ten years with with government as a cyber security consultant engineer And I moved to control plane when I was became interested in Kubernetes and container security I am the author of a couple of courses for the Linux Foundation on zero trust I do some training for O'Reilly, but my day-to-day is as a consultant Very well. So control plane right quick. We are as I said a cloud native security consultancy Established in 2017 based in London, but we operate globally Really specialists in a cloud security Kubernetes in general container security customer base includes governments financial services and in general we like to work with regulated industries for reasons that will become clear later in the talk and we have over 40 50 people across the continents as I said Everything to do with the Kubernetes container cloud security us zero trust architectures DevOps that's a copsa infrastructure and application delivery then we also fill the gap between a cloud infrastructure cloud architectures and security a passion center ensuring visibility and Response capabilities for organizations, then we harden as DLCs and supply chains and then we pen past all the above Just a quick shout-out to our community contributions. Of course, we are Member of the Linux Foundation. We are part of the security tag or technical advisory group our CEO and the marketing is the co-chair for it And then also we are a silver member of the CF We are also silver members for the Finos or the fintech open source foundation And our CEO and the marketing is also the pro bono CISO for open UK miss you Okay, so let's talk about our topic for today So we're going to start with a general introduction to the AI ecosystem And we're going to talk briefly about different levels and types of artificial intelligence We'll dive into a few use cases and then focus on the topic for today's talk Which will be machine learning and what we will do is focus on the machine learning life cycle Based on this we will try and gain an understanding of the problem space of securing such a life cycle Which is as we will see a non-trivial task. We'll see lots of Parallels with supply chain security We'll then get to the most interesting part of the talk where we'll actually present a threat model for a Real-life use case in the SATCOM domain I will go through a threat modeling methodology to navigate the ML sec problem space and identify risks and Security countermeasures to mitigate the most common issues that will arise All right, so let's start with a very high-level introduction to the different types and levels of AI The current and most basic level is what the experts call artificial narrow intelligence or weak AI It's developed and designed to carry out a finite number of tasks such as Reactively answering humans questions and can only operate on a predetermined set of data So examples would be Siri Google assistant and the infamous chat GPT a student's best friend, of course The intermediate level of AI would be called artificial general intelligence or strong AI now This is aspirational This is for the next decades and at this point is purely theoretical So fictional examples would be how 9,000 or data in Star Trek or Skynet in Terminator The major difference here is that this is human like AI and more importantly Can adapt to new situations Rather than operating as before on this predetermined set of data Finally we have truly hypothetical Academial dreamy stuff where we have artificial super intelligence This is where the AI possesses superhuman capabilities and it would be characterized by continuous self-improvement More examples here would be Ultron from the Marvel universe or Gizgard from Asimov's novels Okay, so AI as we can see and as we know already probably is a large domain with a number of branches Here is an overview kind of a spider diagram of these domains And we will show you today where we are sitting within this diagram So we're looking at machine learning So machine learning is a branch of artificial intelligence that focuses on mathematical formulae and statistical models That computer systems employ to carry out operations without being explicitly programmed Instead they can use ML to take what they have learned from the present context and Generalize this to new tasks that modify their programs automatically Within the ML branch we have three different types and here you can see kind of a use case diagram Where we have these three types and the use cases around the edge So first of all we have unsupervised learning This analyzes data without predefined labels Identifying inherent patterns clusters or structures within the data. It's used for tasks like clustering or dimensionality reduction Reinforcement learning then at the bottom of the diagram involves an agent that learns by interacting with an environment receiving rewards or punishments for its actions the agents objective is then to To maximize cumulative rewards Making it suitable things like game learning game games or robotic control things like that Supervised learning then utilizes labeled data and this is what we'll be focusing on later To train a model to make predictions or classifications based on input features The model learns to map inputs to outputs making it useful for tasks like regression or Classification and Francesco will go into some more detail on this later on So let's focus then on these two use cases within the supervised learning branch First of all we have classification. This is the process of assigning a category to input data sample so example uses would be predicting whether a person is ill or Detecting fraudulent transactions given a transaction history face classification things like that Regression then is the process of predicting a continuous numerical value for an input data sample Example usages here would be assessing house prices given market conditions Forecasting grocery store demand temperature forecasting things like this So in supervised learning the model operates Optimizes internal parameters based on the input labels to discover patterns and relationships in the training data These variables control the transformation and mapping of the input data to the output labels and once trained the model can use fresh Input data to predict or categorize objects based on previously discovered patterns On this over to Francesco for a bit more detail Very well. So now we have a high-level understanding of AI in general But I really appreciate what could go wrong which is what we do at control plane with threat model everything What can go wrong when you build a AI ML platform? It's important we discuss the overall life cycle associated with such platforms from the train from the data being ingested Through the model training all the way down to producing applications for downstream customers or other applications to consume the train model So to do that we will draw an interesting parallelism with something that most of you are familiar with hopefully Let's then talk about AI ML ops life cycle and to start we'll start we'll begin with something absolutely groundbreaking Something no one has ever heard about DevOps So let's start with the good old DevOps loop Part of the SDLC pretty sure everyone is a pretty familiar with the DevOps concepts and the DevOps loops Which consists of a continuous Process that integrates a software development and guess what IT ops to streamline and automating software delivery I'm not going to go through each phase But one of the key elements of DevOps is really to foster collaboration to make sure People are on the same page different teams work together in a cohesive manner Towards an objective which is to deploy applications fast and streamlined that basically yeah Has this human element to it, which is very important Why because exactly the same principles were applied to another loop the data ML loop along the same lines this loop was introduced to this define the Continuous end to end pro end to end process from gathering data to trained ML models Forcing data scientists data engineers and ML engineers to actually work together You know here we have the exact same loop but defined for data ML from DevOps to data ML continuous process to integrate data operations we machine learning development to streamline the deployment and management of machine learning models however This only covers for data and model training How do we then produce applications based on that? Well, guess what? Someone put the two loops together. They came up with this or the data ML DevOps life cycle aka the eternal not and really this is a You know converging the two loops into this Eternal not which is a fundamentally a paradigm to then build a reliable and scalable ML Ops AI based systems and applications and this really shows you the end-to-end from data being ingested to models being trained to applications being developed to then consume these models from Downstream customers as I said other applications as well. However, here comes the issue We spent a long time a long long time and we are still spending a lot of time Securing DevOps right shift your left introducing DevSecOps all that good stuff However, what do we do all for the other loop right now they are together and The security of one can influence the security of the other So what do we have to do effectively the exact same thing? So we have to focus on securing that loop to achieve end-to-end security for the eternal not aka introducing the concept of ML sec or as some other people say ML sec ops and this is the challenge actually All right, so we need to understand how we're going to build security into every stage of this Set of processes that Francesco is alluding to So we need to understand the full problem space that we are dealing with So as usual with new cool technologies, we have more complexity. We have more layers So let's break things down and try to understand At a conceptual level what these layers are So at the bottom, obviously we need infrastructure. We there's been Tens hundreds of talks at this conference on securing infrastructure. So we're not going to focus too much on this today We of course have an orchestration layer on top and we actually need this to define Execute all of the operations on data and on the ML model itself Data is at the heart of ML. Obviously. So the data layer is crucial when you think about collection IO labeling the data sets themselves Securing the data layer is foundational and fundamental Even more important is the model itself It's crucial here to secure the operations when training the model and then finally we're going to have apps Leveraging our model and we will have these as entry points So we clearly need to define our security countermeasures for these apps as well However, we're not going to focus much on the application side here application infrastructure We will slightly de-scope and talk about tangentially Okay, so we won't go through this quite busy diagram end-to-end Because we'll cover these things a bit more slowly as we go through the next sections of the talk But this is a detailed view of the ML sec problem space and the different stages And The important thing is to know you have this diagram here You can download the slides the important to know thing to know is it's complex and let's start breaking this down into more digestible chunks So overall the main challenges when securing ML ops have to do with complexity dependencies and data flows It's a multi-dimensional problem which encompasses among many other things data security model security access control and supply chain security Okay, with that enough of the theory over to Francesco. Yes. So as we always apologies It's got falling every time as we always advocate that control planer when we face a large problem space There is only one technique to navigate through that Problem space and which is threat modeling. So let's jump into the interesting stuff Right and to do so we'll do it through a use case. So you guys can Enjoy some pictures, right? So to show the effective threat modeling We start from this use case in this Picture you can see a number of spacecrafts. They are very expensive pieces of hardware very large as well Flying up there in the sky. They they send down a constant telemetry stream Telemetry comes with thousands of parameters sensor temperatures all sort of stuff These telemetry streams they reach earth. They are converted from RF into IP communication and the data they flow The data flow to centralized locations where it's collected than for analysis Why this troublesome approach because telemetry is very important for one thing. It tells you when things break, right? So you have a people keeping an eye on systems That can tell you if something breaks up there and then based on what breaks you can take some some some actions to make sure that Yes, the whole thing doesn't explode that right so the initial approach was a very reactive approach to anomaly prediction So based on the historical data based on what we know and based on a very long list of if then else statements They try to again understand what something was was gonna break and then take appropriate actions a very reactive and ad hoc approach Company X which is the the guys behind these whole things that can we do something better? And the answer was yes, of course, and they kicked off the a project The project consisted into moving from a reactive approach to a more proactive approach so we have all these petabytes of data can we make a better user of these petabytes of data and This is the stack they come up with Nifra layer doesn't really matter at this stage a proper operating system and then on top there was a you know modernization chose a workload orchestration that you can guess being a Qcon and then on top they built the mlops layer on Q flow extended with tensor flow and Then on top they built the apps to do what again to predict when something could go wrong Based on the data and based on the current situation Right, of course company X is very very security cautious and reason being they have very very sensitive customers And they have to be able to demonstrate a quantitative security in a way So company X chose to run a threat model of this entire Infrastructure and the way they were doing mlops on the infrastructure as well The assumptions for the threat model We are dealing with a sophisticated threat actor with significant capabilities, so not a famous script kitty with a hoodie No disrespect. I was one of them. So And then ML was used to inform only about what could what was going to go wrong potentially not really used to fix anything Still humans were involved there and then we actually descoped This coped over the air communications the ground segment back bone and the data center the infrastructure of the party system layer and Kubernetes to a degree Now the next picture is going to be very busy that we will walk you through it This is the reference architecture of the system put together again We are at a fairly high level of obstruction in threat modeling We start at a level zero and this is the data flow diagram So here what you can see is effective in the layers The scope of the threat model was within the red dotted line And as you can see there are humans accessing the infrastructure data being moved around And then we have different stages. So, okay The mlops layer is where you define pipelines to design the way data will be analyzed and how and then on the data model And lifecycle layer you have three stages after the data ingest You have the data preparation stage That's when you clean data normalize data Transform the data as needed and then label the data for the training process itself train and tune stage 2 data loading waiting assigning different weight to different data within the data set tune hyper parameters for the process the training process itself The training process itself and then evaluation of the process itself and Then the first stage deploy a monitor now you have a model a train model you want to Serialize it aka convert it into files So that then can run on a runtime in this particular case being Kubernetes and being cube flow Everything you see here is actually containerized workloads spoiler alert then the model is tested Is deployed into production and then it serves Applications and downstream customers through a front end and then this whole thing is monitored now. Let's deep dive Into the first stage data ingest and preparation now we Go low one level down in abstraction. We are talking about level one infract modeling So data flow diagram is here as you can see you still have humans You have different Steps you have data storages in this case is an ingestion storage and a pipeline storage and This how it goes so data sources can be anything data ingest is a process of Collecting this data and storing it into an ingestion storage. This is outside of the pipeline then a Containerized workloader in this case the clean job loads data from the ingestion storage Does what is supposed to do and then stores data in the pipeline storage in these three initial steps You can already see how many entities are involved and how much data is moving around This is already determining our attack surface by the way Then the normal eyes again load and storing of data in the pipeline storage Transform same thing and then label same thing. We also have here trust boundaries and when data goes across trust boundaries Is when we have to focus right so in threat modeling what can go wrong based on this we start enumerating threats, right? And again, what this is also one of the key takeaways for the talk This is not about AI security necessarily. This is about things that you need to do on everything because threat modeling applies to everything and the threats are also non AI specific like data injection like poisoning of data like compromising the data quality in the ingestion storage right or a Malicious actor can target the clean step can compromise a clean step and job itself And maybe they can clean up some critical data critical for the Training process and we are talking about petabytes of data. So it's hard to find What the data? If they actually target specific data and what data they removed The same thing that injection and poisoning of data can happen at pipeline storage level and then in the label Task Malicious actors could potentially tamper with the labeling process itself So causing different late data to be labeled wrongly threat modeling. What can go wrong next step is what do we do? Based on the threats we enumerated remember we are in the Kubernetes and Kubeflow world that we can identify controls controls such as Digitally signing of data again selectively is petabytes. So I have to be careful what we want to do to actually sign Introducing RBAC on storage operations again. These are very common controls the implementation may differ, but they are single good old controls network segmentation for storage and Then back to the individual containerized workloads Very fine signatures of every image being loaded Digitally signed data the data being then sent to the pipeline storage or R-bucks on tools and R-bucks on the infrastructure itself and then eventually manual reviews and For the pipeline storage pretty much the same now the pipeline storage is actually within Kube So you have a R-bucks on you can introduce stringent role-based access control on Kubernetes storage operations You can introduce secured manifests. You can as well introduce as mission control or OPA to prevent unauthorized volume mounts Next stage training tune This is slightly different in the sense that we have also a bunch of artifacts in version control repositories and these are Things like the training jobs the way the training job is described or the tuning jobs and these are files Manipulated by data scientists, so they they stay on a version control system Now it's pretty much the same each step runs as a containerized workload excuse me and Each containerized workload loads and stored data from and to storage now again Waiting training hyperparameters tuning and evaluation Threat modeling what can go wrong? The threat actor could target the version control system so they can effectively be stealing keys or whatever they can Manipulate the training jobs themselves. They can manipulate the tuning jobs. They can compromise They can compromise the model itself via tampering with the gradients They can cause a model degradation due to poison hyperparameters I had to learn all that by the way when you know, it's not something they it's not straightforward They came bypass the pipeline so they can inject the specific workloads at runtime They can again cause model degradation Because of poison training data or the usual threats to storage as well as the model compromise by a parameter tampering threat modeling next step after we Understand what can go wrong and we enumerate the threats What do we do about it? Identify controls and again, these are non AI specific is about securing AI and to end but those are not AI specific multi-factor authentication on the version control systems sign commits to keys principles for impactful changes to files and Then signing of images verify image signatures and then sign the outputs And that storage level as before are back on storage operations that your manifests admission control to prevent unauthorized volume mounts and then access control on the parameter server, which is something that Was quite interesting to do Miss you Okay, so now we have a trained model Let's start thinking about deploying and monitoring that what's going on So serialization is the first step that's crucial here where we convert a trained model into files Paction them into a container Image and then load whenever we need to run it So we have this serialization step and we then load the model Well, we have an image registry where images are stored. We load the model for testing and deployment We then serve this to be consumed by front-ends So we're not going to see much new here We're going to take the same approach that Francesco has been taking throughout the first parts of this threat modeling exercise We're going to ask what can go wrong And we'll see some common themes here So we could have something going wrong in the serialization step where we actually Tampa with the with the model itself. We could tamper directly in the image registry Now tampering is one thing, but we could also load this. There was a two separate things, right? We could Store a malicious image and then loading it is a separate attack vector. We have to think about protecting against both of them In terms of test data attackers can tamper with test data in order to try and cover their tracks And when things are being served we can try and feed malicious input to the model When it comes to front-end we have our usual OWASP type top 10 type threats, which are always ever present so again nothing groundbreaking here and Moving on to the next step of what what we can do about these things that can go wrong We will again not see anything groundbreaking. We're going to talk about signing artifacts a lot Now signing artifact doesn't mean that we're perfectly secure All it means is that an entity with an identity that is trusted Like this artifact at some point in time with some definition of liking that artifact So throughout the talk you're seeing we're taking this zero trust approach. We're shrinking trust boundaries down We're trying to define explicit trust relationships I always think zero trust is a little bit of a misnomer and it really means don't implicitly trust things It does mean explicitly trust things and make informed Authorization decisions based on those explicit trust relationships So signing data is of course going to be key and plus signing alone is not enough We have to enforce we've done one step here. We've got some signed data. We need to enforce that that signed data That it's actually that thing that's run so here we will have a mission control at a deployment time For example, again, we're not seeing any groundbreaking new controls here All we're doing is mapping threats to controls for compliance purposes On the front-end side, we have the usual input validation, etc. We have request throttling And when it comes to test data, we want to also sign this so that we can ensure that An attacker cannot tamper with this All right, so everything we've looked at so far has been quite generic and applicable across many different implementations Cube flow itself warrants a threat model and we need to do this We need to scope this and here is a just a spider diagram of kind of some concepts within cube flow We won't go into this in great detail because I know we're a bit short on time It's important to note that we are actually Threat modeling I say we Andy Martin and friends in tag security Cube throw cube flow is currently on the agenda for a threat model. So please we welcome Contributions you can have a look at this get hub issue here and track how things are going There is a Google doc which is publicly available. You'll be able to see it and see what's going on So that is actively in progress at the moment and please if you're interested come and talk to us at the end and we'll Tell you how you can get involved Alright, so again being a bit short on time Here is an example data flow diagram at level zero for cube flow We're not going to go this in massive detail, but basically pipeline components are containerized Self-contained sets of code that perform one step in your ML workflow such as pre-processing data or training a model To create a component you must build the components Implementation and define the component specification Your components implementation includes the components executable code and the container image That this will be contained within Interactive interactive environments for writing and running the code documenting work creating visualizations in a narrative format These are also things that we need to take into account Widely used for data analysis model development and research So again, please have a look into cube flow and and join us in threat modeling this And with that that's from all right. Thanks James as we head to the end Those are fairly Generic threats those are generic controls that everyone can actually understand that there are instead AI Specific threats and we are bootstrapping a piece of work with King's College to actually do some Retaining for AI. Anyway, this is a breakdown of the categories of a more AI specific threats We have a model and system threats. I'll just pick one like model inversion So it's about techniques to extract the data the model was trained on just having the opportunity to deal with a With the model itself output manipulation threats like executable executable code injection Which consists of manipulating model to output executable code for malicious purposes access and privacy this is Getting bigger and bigger for example data privacy violation Which is inappropriate use or storage of personal or sensitive data with all the GDPR implications This case compliance and governance Random and unusual attack like model model subversion or injecting a bias or other undesired traits into the model Undetectably it's pretty advanced people and organization governance lapses inadequate oversight and procedures for maintaining security back to the More general threats and then monitoring and response Inadequate anomaly detection fill into detect the abnormal mode of behavior or security instance in real time We are still struggling to understand how to detect and react to threats in cloud native. Let alone in the AI space Key takeaways When you build something threat model everything Adapting and integrating different to different teams again threat modeling as develops is a practice to bring everyone at the same table the experts data scientist ML engineers and a security folks Common threats you can see the key takeaway common threats still apply to a IML based applications and AI specific threats on the other hand can be quite complex and as usual collaborative approach is required join and contribute to tech security if you can into threat modeling cube flow and Word of advice keep up with the regulatory landscape Because that's ever evolving and it's not going to go away anytime soon that said if you wish We have a free half book here hacking Kubernetes from our CEO and the marketing you can download it from from a bike Scan and cure code and thank you very much. We don't have time for questions, but we are gonna stay There for the for for some time now if you want to come and ask anything at all. Thank you again