 Awesome, awesome. Yeah, thank you so much, Ed. Really great to keep up the conference. And it is a huge pleasure to be doing the opening keynotes for the Kubernetes AI date. I think there's going to be a really great, really exciting lineup coming up. And this talk in particular, we have a lot of content to go through, but it's not just a lot of content, it's very dense content. A lot of the single slides that we're going to be covering today are from other talks that I have given, which, as you would imagine, would be one hour plus content. So what I do encourage everyone in this room is if you're interested in any of the areas that we're covering today, for you to actually either check out further resources, find out a bit more. And more importantly, a lot of the things that we're going to be covering today are not set in stone, ultimate knowledge, it's still something that is being explored, and it really is areas in machine learning and MLOP space that need more brains to get together and try to figure it out together. So that will be, as you will see, a lot of the call-outs that we will be doing throughout the presentation. So a little bit about myself. So as Ed mentioned in the introduction, I am Engineering Director at Cell & Technologies, a company that focuses on machine learning deployment and monitoring. So we have one of the most popular, coordinate-based deployment tools for machine learning. I'm Chief Scientist at Institute for Ethical AI, Research Centre based in the UK that focuses on developing frameworks to ensure the responsible development and operation of machine learning systems. I'm also a governing council member at large at the ACM, the Association for Computing and Machinery. So today what we're going to be covering is a couple of areas surrounding the state of cloud-native production machine learning and MLOPs. So we're going to be talking about motivations, so why should we care, some challenges that exist in the space. We're going to delve into some trends around the industry and the domain, some things that we have actually seen converge in the different areas. We're going to talk about some technological trends that will be quite relevant from both the Kubernetes and the MLOP space. And then we're going to talk a little bit more about the organizational trends. What have we seen in terms of organizations when they look into building capabilities of MLOPs at scale? How do they bring that into their organization? What are the roles that are involved into that and what are the ratios between those roles? And finally we're just going to give some wrap-up words. So let's get started with the motivation and challenges. One thing that I don't have to repeat anymore that I think everybody here will agree is that contrary to previously popular belief the lifecycle of the machine learning model does not end once it's trained. If anything, it begins once it's finally trained and put in production. Once a model is put in production, this is when value starts getting extracted from this machine learning model, this is value that then ends up actually facing some of the real world challenges. Things like data divergence, concept drift, potential requirements of addressing domain specific challenges, whether it could be ethical related challenges or potential risk related challenges. So there are a lot of considerations that need to be put in place for not just the ability to achieve a robust productionization of machine learning, but continuous capabilities that are automated, not for a single model, but at scale, with hundreds or thousands of machine learning models, each of them with even potentially advanced monitoring requirements. So why is production machine learning so challenging? From a technical perspective, we have seen a lot of considerations that have come up, things that go beyond just traditional microservice architectures, the fact that some of the machine learning services may require specialized hardware, things like GPUs, CPUs, et cetera, et cetera. There are complex data flows. You don't only deploy a single component, you have multiple components where if something fails or something happens, it would affect things down the stream and it would also have considerations off the stream. You have complex data flows that have to be not just considered, but then going into the other areas, version, and encompassing reproducibility constraints. If something goes wrong and you want to actually verify something that happened perhaps last week or last year, you would want to be able to reproduce that exact experiment for diagnostic purposes, but also for portability purposes. And finally, there are some compliance requirements when it comes to the use of machine learning. You actually have a close interaction with the domain that you are acting upon. And often, the effects that you face when deploying and rolling out machine learning technology can be, to an extent, that would be even generationally. The impact that you could have into someone's life if there is an incorrect prediction, it could actually affect not just the individual, but it could affect that individual across generations. And that actually leads into the second part, not just the technological challenges, but also the higher level challenges. We come into considerations when it comes to machine learning around things that you may have heard, such as algorithmic bias, misuse of personal data, but also it could be to the traditional software extent, things like software outages. What if you are actually running a machine learning service that powers critical infrastructure? Where that service goes now? What are the considerations for that? And similarly, there is the challenge of security that you face in the general traditional software space, but that now has to be brought into the machine learning space. And one thing that we have to remember is that the impact of a bad solution can't even be worse than no solution at all. So if you're curious about that, there are other resources that talk a little bit more on that area. Now, there is also considerations like we actually see some at like often when organizations are starting their journey into machine learning and MLOps, they tend to try to hire for these capabilities. They put a job description that tries to hire this duty court. Somebody with double PHD, 10 years of Kubernetes experience, domain expertise, tons of experience in software development, for the salary of an intern. Where are you going to find those new clients? We're now seeing a convergence to the realization that you not only have to get a single resource that is going to be capable of doing everything, but now you have a segregation of roles which we're going to cover at the end. And it's similar to the extent that those will be on the technical domain. It's not just about the technology, it's about the domain, it's about the use case, it's about the abstraction of the domain capabilities and making sure that similar to how we are actually seeing the trends, you're addressing each use case in a way that is proportionate to the risk involved. You're not going to have to bring in the right level of expertise if you're building a prototype for a group of stakeholders versus if you're running out a large-scale machine learning service that will affect hundreds or thousands of individuals with high risk. So that's also the high-level motivational challenges. Now let's start delving in some of the trends, and things we're actually seeing in the field. So the first one that we're seeing is what we can call the consolidation of practical AI ethics. One of the realizations that we have seen is that people came into the conclusion that we can have all of the roundtables we want, we can sit and agree that discrimination is bad, that doing harm to humans is bad, but if the underlying infrastructure is not built by design to be able to integration and deliver those higher-level principles, you're going to end up not being able to achieve those responsible AI requirements. So what we're now seeing in industry is a consolidation that, of course, principles are important to provide that more at start, but it's enough to have every single tech company publishing their principles for AI ethics. And what is now required is not only those lower-level considerations, like industry standards, regulatory frameworks, but even what is critical at a lower level, the software frameworks, platforms, you know, whether it is CNCF tools, Linux Foundation tools, open-source, closed-source tools, those have to be built with those principles by design. And they have to actually encompass those higher-level requirements in order for them to be enforceable and scalable, not just for a single sort of like use case. So that's an interesting area. And we did an interesting exploration on this that was presented at the New Europe's conference, so if you want to check it out, you can have a look at some deeper areas. And similarly, it's not just about the tools. One thing that people are realizing is that it's also about accountability structures, and we're going to delve a little bit deeper into that, but for you to kind of grasp the idea of this is that high-impact large ethical challenges cannot fall into the shoulders of a single data scientist, or a single developed partition. It is important to make sure that you have the right human touch points throughout the development, design, and operation of machine learning systems at scale, similar to how we have seen this trend in the general software space when introducing controls at an organizational level as the name of the software development life-side. What are the steps that you have to carry out? And we're going to talk a little bit about what we can do to extrapolate this into machine learning development life-side. But here you can see that of course it's important for an individual partitioner to adopt best practices to ensure that they are using the most relevant tools that they have competencies in the field. But one thing to remember is that an ethical individual does not need an ethical compound or an ethical component. A little bit high-level, high-profile I guess articles that you have seen in the news where technology has rolled out and it has had like significant negative impact. It wasn't like every single data scientist was thinking, oh yes, I'm going to evil today and I'm going to build this discriminative, bad machine learning model, right? Ultimately it is people that may have the right intentions but may end up with undesired effects. So that's where you have to go into the higher level the team, the delivery process has to be in place in order to ensure that it's proportionate to the risks and then higher even level to the department of an organizational structure that needs to be in place. Of course we would go a level higher we were talking about the regulation and one thing that has been quite interesting is that we have seen that a lot of times you hear that regulation is playing a patch up but recently we have actually seen that tech companies are playing a patch up into regulation and we have seen some really interesting regulatory policy documents that have been published in the European Union to bring in best practices at scale at a national level, right? And that's the definition of regulation, right? Similar to how we have it in organisms is to make sure that we put things in place to avoid the entire, I don't know, market ending up consuming itself into this capitalistic short-term thinking and instead prioritize longevity by enforcing best practices. And we have actually seen that in the proposal of the AI regulatory proposal that came out last year it was actually really interesting how it tackled it by proposing leveraging best practices proportionate to the risk involved and now this year we're seeing another proposal that tries to achieve a similar thing when in cyber security is how to ensure the things that we're discussing in this conference things like supply chain security things like common vulnerability exploit aggressive but ensuring that that is actually in place at the industry level at the national level. So there are some interesting things to check out there and the other thing to emphasize is that it is not just important to make sure that these policy makers are creating these frameworks, these proposals but the people in this room practitioners and even to a certain extent open source open source committee members and contributors they themselves have a lot of potential meaningful contributions that they can bring to the table because it ultimately will be these frameworks that we're discussing in the conference AI day that will be impacting society for the years to come so it is actually important to ensure that we realize that now there is to a certain extent that concept of programmatic governance where the machine learning and software frameworks they will be limited by the capabilities if they don't adopt those type of principles like of visibility, security compliance, transparency by design, they're not going to be able to be enforced at the higher level so that's something interesting to consider now let's go to the technological trends so just to put it into perspective we all remember the early days of machine learning when how it started and now this is how it's going right now so we have pretty much tools coming out every single day every single week, a brand new shining tool for MLOs a new changing tool for machine learning monitoring, so there's a question as practitioners, how do we navigate this highly complex ever growing ecosystem? The interest that we're seeing is that we're now observing a convergence of what we can call an architectural blueprint irrespective of what are the logos that you choose to bring into your organization there is a convergence of what does MLOs look like, anatomically right, how does it look like across every organization and this is kind of like a super super simple way to put it you know there are other resources that we're going to link that cover into more detail, this is just more from an intuition perspective so if we have a look at the difference of like data related areas, we have in this case training data we have artifact storage and we have inference data, this is over-simplifying and massive however the way that we can think about it the purpose that we would have as the initial step could be seen as experimentation assuming that you would be able to convert this training data into useful artifacts in this context, trained models these trained models are what you would want to actually start getting business value from you would use programmatic interfaces through CSB or ETL type systems to ensure that you have the ability to continuously deliver those artifacts into a homogeneous interface so once you have this heterogeneous set of machine learning training frameworks that your data scientists would use for their hyperparameter tuning training evaluation you would want to have a standardized environment where you have the control of your production capabilities so in that case you would have the machine learning deployment, the machine learning for real-time or batch capabilities and then you would be able to extrapolate in an ad what would be that observability your advanced machine advanced monitoring capabilities your outright detection, drift detection explainability now all of the inputs and outputs that you would have for your models you would be able to call that your inference data and it's important to consider that this inference data in itself differs from your training data in potentially many ways it could be different formats it could be that it's not labeled you'll have the actual labels you would want to still be able to have a way to communicate that back into your data data training data to start to create a continuous value generation capability senior machine learning style and we will see that there are some hard requirements in terms of creating those things but this is just an intuitive way to see that production machine learning anatomy now of course there is a key consideration in the context of metadata with so many of these tools lying across every single place of the machine learning lifecycle there is a need for you to be able to ask what do I have out there what are my digital assets that are adding risk within my organization and that are also potentially adding value within my organization how do I discover those things how do I search across those things so that there is an opportunity in that industry but again the key thing here is that there are some architectural footprints that are coming out and organizations are looking to create those so that they can be fine from a perspective of I don't care what logos you put here as long as we have those controls as long as we have those mechanisms as long as we have these requirements and similarly there has been a conversation best in class best in read as opposed to end to end single platform is there going to be a single canonical stack that we can all agree consists of x machine learning serving framework y experimentation framework z model artifact framework what we are seeing is that it's not just a single canonical stack it's a set of canonical stacks and there is this really interesting tool called in class where you can actually pick and choose for each of these stages depending on what your needs are and this is kind of what we are seeing instead of just converging into a single canonical stack you have a potential set of pick and choose based on what your requirements are but we are still seeing a subset of tools that are becoming more popular than others for particular areas we are seeing maturing in machine learning monitoring so we talked a little bit about the concept of monitoring in the cloud native space we are used to operational monitoring monitoring of services what is my latency what are my requests per second what is my number of 500 errors what is my number of operational metrics we are now seeing this extrapolate into the machine learning space into what we can call machine learning specific metrics so this is of course things like performance that you see in microservices but you can also see machine learning performance what is my accuracy what is my precision what is my recall what is my rmsc so machine learning specific metrics that you would want to abstract at scale you don't want to have that sort of like if you have heard in the devil's paces this pet versus cattle concept where you don't want to have each machine learning model have their specialized way in which how you monitor that specific one you would want to have a standardized way in how you capture those metrics and display them into the relevant domain experts similarly introducing drip detection and outlier detection at scale and the consideration here is that drift and outlier detection tools are also machine learning models so whilst you are introducing a way to reduce risk when you add an outlier detector you're also introducing risk because you're putting another machine learning model there are you going to add an outlier detector for your outlier detector is the question that you have to consider what you're having there and then finally explainability not just as a set of techniques but as an infrastructural component is how do you then roll out explainability as an architectural paradigm so those are some considerations to take into account and then actually also going into the next piece about observability by design now that you have this machine learning monitoring component how do you then move from dashboards into actionable insights that are happening without you having to go and look at a dashboard and this is kind of like bringing the best practices from observability in the microservices space into the MLOP space so things like alerting things like introducing machine learning specific SLOs like service level objectives or SLIs what are my indicators bringing in concepts like progressive rollouts is do I want to actually deploy this model as a shadow or a canary and I want to automatically roll it out if it achieves this SLO and then finally being able to drill down on the metrics that you're collecting that it's not just noise that it's actually meaningful things that your data scientists and data analysts can consume we're also seeing a trend where we're moving away from model centric concepts into data centric instead of us actually thinking how do I deploy this machine learning model we're now moving into the concept of how do I deploy this machine learning system and this is to the data flows consideration that we talked a bit before is that we're no longer seeing a machine learning model as an isolated component we're seeing this as a network where if something goes wrong it may affect things up the stream and down the stream so it's important to consider in this case you can see this example being how Facebook introduces how they architecturally define their search and you can see here that there are multiple components with multiple machine learning artifacts and then different data flows across each area so you have like the indexing where they create their indexes from their documents and then how they process the queries in terms of retrieval and ranking to produce a result there is an extremely non-trivial interaction that you have across machine learning components and if you don't have things that we discussed around observability monitoring drift detection embedded as an infrastructural concept you're going to really struggle because you're going to be building each of these things as an individual pet going back to the pet versus cattle analogy that has to be considered so moving from model centric to data centric and considering what are the data flows and the basis and the schemas that you're interacting with at each of the different areas and then we're also seeing the intersection you may have heard the concept of data meshes so data meshes is more popular in the I guess data world I guess we're probably like spark conferences where you're seeing this concept of data mesh which proposes the ability to move away from that sort of like central data lake central data platform, central data team and instead move into a domain specific set of squads how do you then are able to provide this platform capabilities at a domain level empower your data analysts make sure that you have squads of data engineers with machine learning engineers with MLops engineers that are acting on that sort of like domain vertical specific area as opposed to just centralized capabilities so there's an interesting I guess collaboration that is happening between MLops and then data ops slash data mesh perspectives then we talked a little bit about metadata right so metadata is important because of course we already have some really interesting solutions that have come out from the general databases side right when it comes to data lakes you always have to ask the question of what are the tables that I have out there what are the schemas of the tables that I have out there how do I know and search and discover all of the tables that I can query right for my data analysts to consume and how do I keep track of all of the different views of tables that I create we're now seeing the same questions at the machine learning sort of space where you're asking the question okay well so I've trained a machine learning artifact and I deploy this artifact now I have a one to many relationship right because I can actually instantiate an artifact multiple times across multiple environments what are those models that I have running how do I discover what I can consume how do I know what value I can get from each of these models that I have and then now going beyond the machine learning model metadata you're also asking the questions of what are the interfaces the inputs and outputs of that machine learning model how do I know what do I have to do to consume that model so now you're moving into the concept of data metadata and the reason why you know and maybe this is starting to be a bit confusing data metadata is important is because of that sort of link that we were talking about from inference data to training data how do you then convert the data that you're capturing in your inference side in your production side and then bring it into your training data lake right you have to have an understanding of what's the schema what's the shape of that data so you can convert it and then use it and process it so you can see here you know it's quite interesting because you're asking the question of well how do you go from inference data back to what you know you see in the early starts of the machine learning life cycle data labeling right can you bring some inference data back all the way to your data labeling so that like the annotation practitioners are able to go and re-label that data that new data and training models so it's interesting to see like how the metadata interoperability considerations are now extending not just into the deployment side but also they're becoming completely end to end so then going now back into the end to end you know I guess life cycle of the machine learning model one thing that we're also seeing is being considered is the question of security right what are the potential vulnerabilities that I can find throughout the end to end life cycle of machine learning model and when you actually look at this traditional sort of like you know machine learning model life cycle where you have like the data training the model building and then the deployment and monitoring you can ask the question well what are the different areas in the ML life cycle where you can find security vulnerabilities and the interesting thing is that that would be on all of the red areas which means that every single part of your machine learning model life cycle has potential security vulnerabilities so security has now become quite a key area quite a key area of research in the ML op space because now we need to ask the question how do we achieve safe and cyber like cyber secure machine learning systems that will be enforcing controls at every stage of the machine learning life cycle and actually there's going to be a talk later today that is going to be talking about some of the security vulnerabilities as you know machine learning practitioners we all know we love pickles so you know we need to guard ourselves into the potential risks that are involved in that so that's some of the technological trends you know that was very much a whistle stop tour now let's talk a little bit about the organizational trends asking questions of what are we seeing at the team level right so now that organizations are really kind of like building machine learning platforms building ML engineering platform teams what does that look like what are the trends that we're seeing from organizations at that perspective so the first one we talked about that briefly we started to see a trend where organizations started to define what is analogous to an SDLC like a software development life cycle what is the step by step process that should be carried out within our organization to make sure that we are following best practice in our software development right and in that context that would be things like CICD right like things that now we do in our sleep right get using version control right in the software development space this is no brainer there has been attempts to try to just bring the software development life cycle frameworks and just adapt it into the machine learning space but what people have realized is that that just doesn't work and what organizations have been doing they have been creating from scratch a new sort of like set of control mechanisms through what is now machine learning development life cycle so this is actually a really interesting paper that shows the different stages and the different roles personas that are involved throughout each stage of the ML life cycle the ML platform so this is actually something that we have seen organizations starting to define especially now you have organizations that have tens of thousands or hundreds of thousands of developers and data analysts that need to actually follow these best practices to ensure compliance at scale we also have started to see an important shift in mindset traditionally when it comes to machine learning data science projects they historically have been very project driven right I have a question find the answer what is going to be the X revenue or the Y click through rate about this area let's find the answer from data so very project specific the challenge with this is that organizations have seen that by delivering projects there's a lot of there's a lot of like repeatability of the same sort of like capabilities so now we're starting to see that mindset shifting project to product so treating machine learning capabilities as data products and by data products what you can think about this is not just trying to see the perspective of well should I if I am trying to develop this classification model to predict another risk for an insurance company is the product the app that the users would leverage the answer is no that's not a data product the data product is what you develop for your data science stakeholders right the tools the platforms the decisions where you say hey let's roll out airflow and let's actually build an interface for airflow let's roll out this metadata management system with discoverability and search capabilities so that teams can actually see what other teams have done in the past and build upon it right treat treating this as products treating this as a consultative approach to product development and that then leads us to how this looks like from from a I guess organizational vision and strategy perspective you still have to have those short term business project value delivery capabilities where you're just delivering value answering questions with data bringing those kind of like deliveries from that perspective however you then now have to make sure that you have different cadence of road maps where you have iterative tooling development with perhaps a machine learning engineering team that is developing pipelines that is developing tools you then may have a you know MLOps team that is actually developing tools and products and then perhaps even platforms that are powering the entire capabilities of the organization as what we see as that canonical MLOps stack right is how should every individual within the organization you know work across their ML life cycle you know they spin up a Jupyter notebook and then suddenly they actually get a Kubernetes name space and then they get like a repo where they can actually you know interact and deploy there so this is actually something that we're seeing to make sure that it's not just somebody going and delivering the exact same thing multiple times and then this also requires bringing some of those best practices that we started to see in the software space you know as many of you may have heard of the Spotify squad model right so this squad model is this cross functional capabilities that you have for product development for project delivery this is also a methodology that can be brought in the in the data and machine learning space right the ability to have that cross functional squads with with with the concept of what you know they refer as tribes and guilds but at an organizational level right how do you make sure that you adopt these capabilities and you have a product approach into your data and into your machine learning and into your infrastructure so that's actually quite key and now just to kind of like you know going to the into the final few things now let's talk about what are those team members what are the team members that are involved into that and more specifically what what do we see being those organizational ratios so one thing that we started to see and again this is still a discourse that is being explored is we see you know in a simplified view the roles the MLops engineer the ML engineer the data scientist the data analyst and then other roles like back in engineers data engineers which you know we don't cover here but but just to think about it the way that we started to see some some of these ratios is that you would tend to have one MLops engineer that manages parts of the platform that contains what we can refer to as multiple use cases or multiple use case pipelines right one machine learning engineer would be able to build multiple of this programmatic you know use case pipelines whatever you want to think about this could be ETL CICD etc etc but these are machine learning engineers that are building kind of like continuous delivery of machine learning capabilities each of these use case pipelines would have a many to many relationship to a group of data scientists right so a group of data scientists maybe actually producing models for one use case producing models for another use case one of these use case pipelines maybe just like a one of those use case pipelines maybe actually a programmatic continuous delivery capability of a machine learning model that gets retrained every month or something like that right so this is something that we're starting to see and then the concept of this data products that could also programmatically interact with those use case pipelines right perhaps you have something that is automated whenever there's drift detected a new retraining capability gets pushed and then of course building those abstractions so that data analysts or domain experts can bring their domain expertise without having to become Kubernetes experts right making sure that you are able to segregate the domain knowledge so that not everybody has to deal with Kubernetes manifests so this is actually an interesting thing that we've seen with with organizational ratios and then another thing to consider is that this actually talks about scale right this assumes that you have like maybe dozens or hundreds of machine learning models with like a pretty robust capability for continuous delivery of data analysis at the organization right like you're doing data driven decision making at the organizational scale and you have a pretty large team for machine learning platform and machine learning engineering but you don't have to start with that right like you don't have to start with the full might of Kubernetes when you just have one machine learning model right so when you have a few set of models what we tend to see is that just having data scientists and in some cases this are unicorn ish data scientists that actually have the full life cycle of the machine learning model in their responsibility right this would be people that perhaps would be using AWSH maker or perhaps that are using fast API to wrap their own machine learning models and serve them now in that case in that case you what you would have is a consideration of having these data scientists managing the full life cycle that can be managed because they are able to do that given that the number of models are very small and this goes back into that pet versus cattle methodology where having a single model is something that they can look after right they can say okay well this machine learning model is something that I know how to monitor I know how to manage once there's actually a larger number of machine learning models that's when the data scientists start being distracted by doing DevOps right they are starting to build CI CD pipelines they're starting to actually manage the operational metrics they're starting to be called in the middle of the night because something crashed and that is when you start seeing this introduction of the role of a machine learning engineer this machine learning engineer which is focused on building those repeatable use case pipelines this managing the initial set of like you know infrastructure that may not be as large as an end to an platform but it's still segregating the roles and responsibilities and then of course once you start having production environments production platforms that require SLAs to be managed in those contexts that's when you start bringing those DevOps engineers or MLops engineers that are specialized in the maintenance of that platform so yeah so those were some of the organizational trends that we're seeing you know again this is something that is more of a discourse to explore and just to give some closing remarks on the wrap up we do have to remember that not everything can be solved with AI right when you're running around with a hammer everything looks like a nail right so we have to remember as machine learning practitioners that not everything is solved with AI sometimes you just need people sometimes you just need processes to be able to address those things and of course you know it's important to have that sort of call for further discourse as these are things that are still being explored and this is important that we continue exploring these topics because we have to remember that you know critical infrastructure increasingly depends on machine learning systems and regardless of how many layers of software layers of abstraction the impact will always be human and we have a responsibility to ensure that these are following best practices that are following those high level principles that we discussed initially so with that I want to thank everybody for you know I guess keeping awake throughout the full presentation you can find the slides on the corner so bit.ly state MLOps and I hope everybody enjoys the rest of the conference so with that thank you very much