 I'm David Duncan. I'm a partner solutions architecture manager, which means I'm on an alliance team and I spend a lot of time working on Red Hat Solutions. Could you just look to the right? Sure. How are you? Yeah, that's a sweet spot. Sorry. David Duncan, partner solutions architect. I work as a manager now, I guess. And I've been working on Red Hat Solutions on top of AWS for the last eight years. And before that, I've worked with Red Hat as a software development engineer. Surprise. I wanted to talk a little bit about machine learning ops. Not because I want to spend a whole lot of time teaching you about ops, because I think that there's a fair amount of understanding of machine learning and operations at Red Hat. I think almost every engineer here is participating in operations space. They're doing testing, they're doing build and test methodologies almost every day. And I think that there are very important key understandings that we have in design that are important for success. Like I said, I am an architect and that's going to become painfully clear in the discussion points here. And I kind of put that up on the slides here to identify where it is that you're going to see me. We're going to see my talk and we're going to see the things that are associated here is that I put a lot of time and effort into determining where there are target requirements. And then trying to figure out how we're going to build products technology to match those. But I do that in the context of the cloud. And I think that the cloud is a very important part of machine learning. We've all been inundated with conversations around generative AI, or at least I have. And I think my team has been sick of hearing the words generative AI twice in one day. And I wanted to talk a lot more about designing for success and what key components are there and how to kind of isolate what needs to be done. And maybe this will help you as engineers or data scientists or machine learning engineers or practitioners to find a way to structure your ops and structure the configuration. And then look at some of the tools that are available to you. I'll get into this in a little bit, but my experience has been one in which that design has led to product designation. And so I'm spending a lot of time talking about where those products landed and a little bit about how the development cycles and what made that go. But at Amazon there's a practice we call well architected. And I was part of the team that designed this as well. But it's a big team and this doesn't look very clearly this is not a one person job. So in early days we were just trying to figure out how we would do five pillars. We had four pillars when we started. We got five. Now there are six. Super exciting continues to grow. But I'm constantly looking at where things focus in terms of operational excellence, in terms of cost, in terms of sustainability. How all of that effort comes together and where it comes through. And what we do at Amazon that I really like and I'm going to get to some of the things that I really like about the design principles that Red Hat too. So in that open source community, especially around open infrastructure, which we have some people represent here at the front. But there is what we call the lens. And the lens is a way of actually taking those the challenges that you face in any one particular component of infrastructure development or your line of business. And really focusing down on the core decision practices that make that happen. On the Red Hat side, what I really enjoy and what I've talked about a lot with other people who are involved in this, some of them who were on stage this morning for the keynote like Steph and Brad are things like architectural decision records. And creating that infrastructure as code or the decision as code with a certain amount of historicity. An important component of what it is that you do in your operations. If you don't have that, then you're doing to repeat those same those same decisions, right? And that is a very complex and complicated problem. So sorry, I thought I had a little bit of a pathway. But so wherever you land in the industry, like if you're in gaming, if you're doing large scale databases, point of sale, wherever you're collecting information from, pulling information from doing whatever it is that you want to do, whether that's like some sort of really basic recommendation engine, or you need to do some sort of highly structured publicist groups style investigation unstructured data that may take you in whatever direction. There are lots of places to start. I personally started on the very far side of there in the HPC lens. And the HPC lens was a place where we created a structure for things that was very siloed, right? So you have a front end node, that front end node would communicate simple chunks of data out to a large scale group of nodes that were there to just crunch that data and return it to the front end node. And we don't do that in machine learning, right? Machine learning is typically done in the same way that we do Hadoop, or in a way that is opportunistic now. And the opportunistic modes are some of the things that I think are really, really interesting, and what it is that makes this happen, you know, what makes me excited about it. I just want to talk about it outside of my team. So as you build that structure, as you build that pathway, each one of these ends up having a lens, right? You end up having a way to do this. In Red Hat, you have a solutions collection, and each one of those solutions collections is pretty easy to sort of review. And there are people here doing that research and people who are actually doing research in university who are contributing to those in an open source way that I highly recommend that you spend some time reading, right? And taking some actors there. And then also there's open infrastructure that you can participate in today to experiment and learn and build out a structure of practice that's similar to what we were talking about, or what was being spoken about this morning in the keynote, which is that open source software as a service. And machine learning itself has a lot of component parts that need to be extremely well documented, well understood. And part of that is tools, part of that is finding out what it is that we're doing in our documentation. And then some of it is just basically the paperwork, right, to get things done. But ultimately it comes together in an operations model. This is sort of the crux of what I think is really important for us is that we have to figure out a way to ensure that we have a data processing model, and then we need to ensure that we have a way to build a continuous cycle of this. I guess I could have done this in a circle, but I didn't. I mean, and really the focus that we have is kind of starts for machine learning ops, kind of starts at collect data. These other two components are a really nice thing to have, but they don't necessarily fall straight into the operational models. They usually, this is what you get handed as an engineer and want to continue forward with. So I stuck with, you know, I stuck with leaving that out. We're not going to talk about business. I can't do it. But I can talk about what I think is the continuous cycle here, right? So I guess I did make a circle, and I put it on this slide, not the one before it. Each one of these steps that goes into this process has to be clearly defined, and this is something that I think is really important is that we find the parts of this structure in each one of the phases of the business so that you can, so that you can clearly define them. So I put in here that we were talking about this all the way to the edge. I kind of thought about it backwards, which was all the way from visualization, and all the way the edge meant trying to find ways to tune back into whatever it is that you're evaluating, whether that's the hum of a giant tank of oil or the position of a series of ships. It doesn't really matter. Each one of these component parts is really important. Anomaly detection is one of the things that I first worked on, and I think is a really important part of that edge computing process, like how do I get this data? Over here, which data do I transfer? How important is it? What is the clear path for doing that? So thinking about things from this perspective and listening to customers talk to me about this, I spent a lot of time trying to determine what the first component of this was going to be. So a lot of that just came from determining that models, most people were just looking for a model. They weren't thinking to themselves, oh, I'm going to go make a model. I got one, and I want to really implement this. If you don't know what it does, and you just need to have something that works, you've got to figure that out in this process. This is messy. This is not exactly where you want to have... This is not where you're doing ops. This is where you're trying to figure out whether or not you have a structure. This is essentially the first part of your machine learning experience, and not where you're thinking about how you're going to properly tune this configuration. Good news, once you've done this a couple of times, you'll start to have an operational phase, and this part will look a lot like that little black dot in the middle, and you'll have an understanding of how it is that you're going to create all of these essential silos in which you can now start to create your infrastructure model, and then your CI CD pipelines and how that configuration comes together is really important. And then this first phase, the pilot phase, becomes a lot less messy. So for each one of the next iterations that you have, now you have this machine learning ops kind of model, and this is where you've already got the decision records. You already have all these things in place, and you can start to push them into play. This is the great part about this conversation that we're having here, is that when you look at how much of this effort actually goes into production, and these are kind of little numbers, but they're still consistent. I did a little background checking, and you're like, yeah, these are still good. You're still not making a whole lot of decisions. You're doing a lot of transitioning from one model to the next to try and determine exactly what it is that you're going to get out of it. And I have lots of fun stories about making those decisions, making changes in model, retraining that model, and then determining whether or not the aspects of that determination were in fact valid. And explainability is one of those things that is almost guaranteed that you will have a lot of trouble in this space if you do not have already a process and pipeline that makes it work. I started off thinking about this in the context of this, and I'm highly opinionated, so I'm just going to tell you right now, take this with a grain of salt. I'm truly biased when it comes to what these operations look like in real life, like how they in fact get structured. Because I started from the simple fact that I work on a team where the center of my universe is actually open shift. So I live in a place like this to me represents seven years of my life. Making this a product and turning open shift into a service on AWS was started off by creating incremental changes that people could use in the context of previous versions of open shift. This didn't appear for us until four, but on three I wrote a scale out model because the problem with the operations for building out a structure for cloud on an open shift was that no one had thought about how the scale would work. So scaling out was super easy, you just buy another node and that was a fun part of that experience. But the goal was, I thought, was to scale in. I'm not using it today, so how far can I scale? My favorite support question that I got from that, the emergency support call that you get on GitHub, was from a customer who said, I've scaled down to one node and I don't seem to be able to scale back up. That is true, STD is not going to let you do that anymore. And I'm sorry. But good news, you have a cloud formation template that will deploy it all over again and you don't have to create the auto-scaling groups that are allowing you to do that scale in and scale out. We did that, so the structure of that, the design we did was to create an auto-scaling hook configuration and the auto-scaling hook configuration would do all of the node setup and then when you wanted to scale down, the auto-scaling hooks allowed you to tear down the node in a proper way in the old OpenShift container platform model. And so this was the start of getting those people who were making the business decisions very excited about the fact that they had an incorporated model. And so as a foundational layer, that was the thing that created the excitement that made two business decisions, one on the Red Hat side and one on the AWS side, that brought us a service that would in fact support doing a lot of that work. Now, everywhere you go, that product design started to become a really important part of how it is that we build this concept of two platforms talking to each other and in fact, integrating together. Being integrated together means that you have to solve other problems on top of the problems that you have for just creating a container-based platform. We have many different selection, so now there's many different ways to deploy OpenShift on AWS now. Suddenly you've got a big choice to make on how this is going to happen, who's going to manage it, what's going to be going on. And then you've got these other things that you can do which are the vanilla options, what am I getting myself into? Oh my gosh, now I've got a practice that has to make all these decisions. And that's where the decision records come in. You can go back through those, you can make a decision, and then you can make another one based on that. In fact, there's some in the blueprints for the open infrastructure that talk about some of those failures and the failed decisions or the advantage of one option over another in terms of like making a decision about one open-source project or a platform component over another. And then relating those back to best practice, that's what those that well architected ends are for and the components that you find in there start to force, it's a forcing function to ask questions like is this sustainable? Does this apply operational excellence? Am I getting scalability that I expected to get? So using those techniques and the technologies that are associated with that kind of question-answer experience is a great way to form that audience. So take that, take those six phases and the operations and then look at how you're going to put that together. And you'll start to see that there are some places, there are going to be some big holes. So you need an inventory, right? You need an operational inventory and so once I looked in and said, we've got roses life on the way, what else are we going to need? Well, it turns out that you need a way to build out this infrastructure that is consistent with your expectations and I chose Ansible. So an obvious choice, right? Again, I'm an in-aligned sustained, I know, you know, I know that puts food on my table. But Ansible is also, you know, my personal favorite. I can make the decision for that outside of my experience, right? But there are some things that I think are important and they are supported by creating enthusiasm for that same kind of technology. So identifying that that was something that was really important made me continue to talk to our cloud formation team and say, hey, I really like what I can do with Ansible and I don't like writing this j-time target and then not being able to iterate very well or find ways to do more minimal tests. And that team decided that they didn't like having just one way to do it and that they wanted to be able to incorporate that strategy into what they were doing and so that's how the Cloud Control API was born is out of that iteration, out of us talking back and forth and, you know, that didn't just happen to Ansible, it happened because tools were changing, right? Chef was going away, puppet's going away, nobody wants to use those anymore. The server model doesn't work for us and it certainly doesn't work for machine learning and I'll get more into that through this, you know, because we were looking for kind of an easy way to get, to do the discovery for the resources to create a way of building out those best practices in an iterative model and also in a way that was easily consumable. So Ansible becomes kind of an easy place for us all to create opinionated decisions. On the other hand, it travels well so I don't need, you know, this is just one aspect of it and yes, this is David's opinionated inventory, I'm not, you know, when you get out to the edge you're going to use different collections, you're going to use a different experience, there's going to be a different experience but it's all going to be part of your execution environment and support infrastructure is covered. So originally there was nothing on AWS except the Red Hat Enterprise Linux. In fact, that was the only thing that ran on Amazon originally. In 2006, you could get a Red Hat box, right? That was it. And once we started to look at what it was that we were doing with data there were a lot of places for us to go with that and some of the other things that became kind of obvious was well, we still need high availability, right? High availability configurations were absolutely necessary and we still need a way to manage data at the edge. There are lots of times where you're collecting a lot of information and you need to put that information into a large scale data lake and we found that there were some options that Microsoft SQL allowed us to use in collecting that data at the edge, right? It made it very easy for us to have simple communications over VPN to collect small amounts of data and then pass that data into buckets that were associated with long-term storage. And that has driven a lot of variation in the way that we birth those. So looking at kind of an example of how we build a pipeline we start with code, that includes the infrastructure as code and the open shift kind of stays in the middle. In the early days of machine learning from data scientists and things like that people who were actually crunching the actual algorithms and trying to determine whether or not they were fully functional we have this tool that they call Deep Learning Ami and the Deep Learning Ami was exactly what I didn't want it to be. But it works. It seems to work. Everybody loves it. I think it was a collection of basically every scientific library that you can think of for Python, Ruby and R that's just like dumped into one machine image with all of the NVIDIA crew new drivers and everything and then you can just basically leverage that for component parts. But that doesn't make a pipeline, right? And so in this case, you know, this is outside of the pilot phase these are kind of the things that you end up doing to get structured data. And this represents for me a data practitioner. So if you're looking at the way that this works you'll see that I've got up in the far right corner the concept of using Amazon Athena which is a way of building out a SQL style analysis of the contents of an S3 bucket. So if you have structured data or mildly structured data you can create a configuration and get something back. And then AWS Glue is a way to create tokens and to tag data so that it goes from an unstructured to a more structured model. And you can get those artifacts and get the trained model from that experience and then you can push that into production. So once you've got your approval process to kind of transfer that over on the top once you have an approval process you can move that over into your production cluster that cluster can come with you. And then, or I'm sorry, the artifacts can come with you your trained model can come with you. You can pull the batch inference production data from there and then use that model in a way that is consistent with your business requirements, your recommendation engine or whatever. So that whole process is exactly what I think of when we bring this out in experience and that's why there are tons of tools for doing this. There's a single-node outpost is an edge solution that we can use. We can use that in the context of Greengrass you can impact for the AWS experience. We have a small method for doing basically circular functions at the edge and you can collect that information and stick it into S3. And then this becomes, this is the simplified way. As you become more, you know, as your data engineer who finds this more and more you may find that the basic tools no longer work and you'll create a more bespoke model that is consistent with building out on the Rosa experience where the open data hub tools will take you through that whole process. So that effectively is everything I have to talk about today and then to say that this is a great opportunity to talk about machine learning ops and what makes this work and to see a little bit of an example of from an architecture point of view what it looks like. Does anybody have any questions? You briefly mentioned monitoring and model drift so in your experience how difficult was it to set up both the monitoring part and the model drift like dealing with the tree training and how difficult was the whole cycle for you? Well, so for me the, I mean model drift is less of a concern because obviously I'm more focused on operations than on the models themselves but what I see from my friends is that because the training is happening so consistently when you put this into operation like if you've got the Ansible models that are like building up the systems and then you're doing your basic training and then you've got that model in connection with your data and I didn't mention public data sets but public data sets are like a huge help in terms of training and they come across as an S3 bucket in fact quite commonly I would use those in the context of OpenShift so I'll just grab that and then just put the assignment for the public data set in an open data hub workspace so those like building out the YAML structure for that and making it work reduces the amount of effort necessary for retraining right and so that's what I was talking about like the pilot phase is messy and you will, there will be much gnashing of teeth while you try to push that back together which is again, I don't like that machine learning, I mean even though it's like, it's incredibly reproducible because you just pick one and you just use it over and over again but you know if there's a security vulnerability or something like that or if there's a modification that you really want to take advantage of then you have to switch your entire environment Any other questions? This session is still running I'm sorry Modern serving is obviously a big part of operating models on the inference part a couple of years back there was a myriad of young projects trying to solve that starting from just wrapping in a class application and then deployed as a container that TF serve and some other projects trying to solve that I wonder if we now are converging to a more cohesive consistent deployment model when it comes to inference part and maybe what is Amazon doing on their back ends or their AI services is it always just homegrown depending on the use case or is there some convergence I think there's a good, well so there is always an opinionated convergence so if you look at Red Hat I'll start right there and if you look at how Red Hat does this in the other data hub right you have all the tools that are associated there and then you have a way to just generate you know, I mean the standard on structures is to support the pilot phase in Jupyter notebooks and then create that configuration and maintain those models in specific containers so that you can have that container space but I don't know that Amazon the way to do that is they create SageMaker they use some very opinionated models and they say you Mr. Customer can use all of these exactly the way that they exist today or you can maybe later make a decision on how you're going to use it like, you know, use a different one interchange them but in my opinion, you know part of what we did like building a rail workstation on top of AWS was to create spaces where if you knew what you wanted to do in that messy phase you could have immediate access to the right kind of hardware from a Red Hat instance where you would know about it and create an open source model where, you know, you might want to use it I think we're out of time One more minute Oh okay, one more minute Another question? So you do mention that I'm a double station worker so I've been using a double station worker for a while but I wonder is primarily intent for training models but do you have any sort of Well, okay, so talking about it from the perspective of the Amazon products so SageMaker I think is really meant to leverage models that are there and then you have the ability to train, you know to do some additional training on some of the exit models and then you can add your own if you want to in that process there is what they call machine learning ops which is basically a service or machine learning ops that is available there for me, obviously that's not where my work extends into how I can make this work on the Red Hat Open Data Science model which kind of gives you the same similar kind of environment but done in a very open source play in a way that is achievable on other platforms so if you have a hybrid scenario and you want to have the same experience from your on-premises to your public file experience then I would say you can do that in the context of Open Data Science or Open Data Hub and the SageMaker would give you sort of similar tools in that sense if I wanted to go build a cluster and then use Dask to handle a complex problem across a suite of systems I can do that as easily in the context of an OpenShift cluster on AWS as I can just building that out in terms of machine learning with SageMaker and the SageMaker networks well it was a very nice talk and if you have any further questions I think you would be more than you know, glad to have them I'd be excited to talk about it, yeah talk about it all day