 Good morning, folks. Welcome to the session on open data and AI ML for storage system reliability. Our presenters today are Yari Tatuka, who is a software engineer in the emerging technology storage team at Red Hat, and Karan Raj Chauhan, who is a data scientist at AWS. He works on improving and optimizing cloud operations by applying data science to it, as well as building an open-source community around this domain. Yeah, we're going to talk about how we can improve storage system reliability using open data and machine learning. We're going to talk about the motivation for this project, how we're building an open drive health data set. We're going to see a dashboard demo with all the data that we have so far. And then we're going to dive a little bit into some initial exploratory analysis and some of the unsupervised approaches that we can do on this data. And then we're transitioning to talking about some of the challenges we've faced so far and some next steps for this project. So let's start with a big Y, reliability. So data centers experience frequent IT equipment failures, and the sad truth is that all disks fail. It's just a matter of when. This is one of the most dominant components with failures in data centers. And it's true that storage systems often implement redundancy that could be hardware or software aid, mirroring, radar coding, and so on. But whenever a drive fails, the redundancy factor is decreased and that increases the chances for data loss until failure is addressed and the data from the failing device is restored. And of course, performance is also impacted. Since free replicating the data is usually a high priority operation, it can happen during peak hours, and that involves a lot of drive IO. And if you could predict when a drive is going to fail, you could preemptively migrate the data directly from that any drive while it is still online. So there's no need to involve replicas or redeveloping from RAID. And then we can schedule the data repair to optic hours and avoid undesired load on the system and without lowering the replication factor. So why do storage devices fail to begin with? This has been the subject for many studies during recent years. And there are different types of storage drives that have different underlying technologies and that means different reasons for failures. So for example, an SSD can simply fail because it reached its maximum write endurance limit while a hard drive can fail just because of vibration on the server rack. Other potential reasons can be temperature, humidity, firmware errors, noise, sudden power outage, or media defects. And this just in, that's from yesterday. Apparently, hard drives can allegedly crash if you play them some Janet Jackson music. It involves a certain frequency. And it is filed as an official FBE. What if the drive could communicate useful information about its health? So it can. And there's a standard for health metrics reporting. It is called SMART, which stands for self-monitoring analysis and reporting technology. It is thresholds and attributes. For example, the drive can communicate its temperature, its reallocated sector counts for hard disk, its wear leveling for SSDs hour and hour and so on. The reason it was developed is to evaluate the device's health based on the data that its internal sensor can communicate and to alert in case of an M&M failure. And it does include a built-in assessment for the drive. But the thing is that it is not accurate enough and has a lot of false negatives. And it just provides a brilliant result, whether it's failing or not failing. And if it is failing, then when? Do we still have time to rebuild and schedule a rebuild for optic hours? So there are recent years of studies that use the full range of the SMART attributes, data, which wanted to build a better model. But for that, we need data. We need a lot of data. What do we have available today? So much like the proprietor versus open source software paradigm, we see the same thing with data. And academic papers rely on proprietary data sets which are under NDA. They mostly come from Amazon or Google or big cloud providers. And of course, stored vendors, they also have their own private data sets, but they cannot share it. And back in 2013, Backlays, which is a cloud storage provider, they pioneered an open drive drive health data set, which is the go-to data set for data scientists who want to come up with a better failure prediction model. But the thing is that this data set is lacking a hard disk and FSD vendors and models diversity. So for creating an effective prediction models, we need a large, diverse, free and open drive health data set. And we're building it with set. Set is a distributive storage system. It provides objects, blocks and file storage, all in a single cluster. It is software-defined. It runs on commodity hardware. It is self-managing wherever possible. And it is free and open-source software. So how do we do it? How do we build this open data set? So back in 2018, we introduced the device health self-manager module, which allows for user-friendly drive management. It includes scraping and exposing drive health metrics like SMART. And about the same time, we introduced the telemetry set manager module, which allows user to report anonymized, non-identifying data on, both about their cluster deployment and configuration, and about, we can also report anonymized health metrics of their entire story drives. Now the cluster data helps self-developers to understand how set is being used in a while to identify an issue and to prioritize bugs. And the device data is aimed at building this open data set for data scientists in order to create accurate failure prediction models. Let's take a look at the architecture. So we have a set cluster. We have a set manager module. Sorry, set manager demon that collects all the data from the other demons in the cluster. And then it compiles the telemetry reports. These are JSON reports. And then it sends it to the telemetry backend where we have a cross-gres instance and a grfana instance that queries this database and presents all sorts of aggregated statistics on a public dashboard. And we also upload the device health metrics dataset to the Massachusetts open cloud. A word about privacy. The data collected is anonymized on the client side. We're not collecting any identifying or sensitive information. So there's no pool names, object names, object contents, none of that. We replace the host name and the drive serial number with random new ID so we can at least identify them. We never store IPs. And they're separate telemetry reports that are generated. One with the anonymized cluster data and the other with the anonymized drive health metrics. And they are sent to different endpoints. And this is in order to enhance user's privacy. Let's take a look at the dashboard. So real quickly, I'll just show, this is the cluster data dashboard. It is public. And you can see a lot of aggregated statistics here about the clusters and our reporting. And this is the devices dashboard. We can see that there are about 50,000 devices that are reporting. And I'll talk briefly about what we mean by valid telemetry. And this is for the last 30 days. We can see that C-Gate is the most popular vendor out there. And right after there's HGSP, which was acquired by West Indie Show. And then right after we have West Indie Show. So these are the most popular devices that are reporting. We can see the trend for how many active drives are reported in the last 30 days. And of course, you can change the time range here. Interesting to see the devices by type. So we like to say that hard disks are all technology and SSDs feature, but it's still super popular. This is the majority of reporting devices. And we like to see a breakdown of SSD devices by interface. So most popular are still SATA. Very close to, we can see that Android are also very popular. And then that is just not so popular. Thanks for our reporting devices. And so you can see also tabular panels. And you can play with looking at the specific vendors or models and still provide that. So just to worry about the data format. So first of all, you can find it on this link. The data is split into a CSV file. So you'll have a file per month. And each file has four columns. So we'll have the timestamp, which represents when the health metrics were spread. And then we have the device UAD, which is that anonymized unique identifier of the reporting device. And then we'll have the flag that says invalid. It notes whether the content of the report is valid or not. And what we mean by that is sometimes a smartphone tools version is too old and it does not have the JSON option. So the output for smartphone tools will be just an error. And it could be also due to pseudo errors issues or sometimes the drive is under a VM or hardware raid. So it will not be able to report valid health metrics. This is not a label. And it's important to mention that the reason we include these reports that are errors basically and with invalid commentary is to serve as a heartbeat. So that we know that the drive is still online. And then we just have the report, which is the JSON blob that holds all the output of the device report, which is smart metrics and the empty metrics we're able to. Cool. Thanks, Arit. So in the next half of the presentation, I'm going to go over what does the data actually look like? How can you work with this data? And what are some of the things that you can do with this? So the first thing to note is like, like Arit mentioned before, it contains like four different columns. And this data is essentially a collection of different time series. So for each device UID, you have a, at each point in time, you have a JSON blob as the output, which is the meat of the dataset. And then you have that for a bunch of different devices. So, well, first things, a JSON blob doesn't make for very informative or pretty graphs. And also it's not something that most models can directly take as inputs. So the first thing we need to do is to actually convert it into a dataset that looks more like a row, features matrix that usually a model can take as input. So there's a bit of restructuring gymnastics involved with this, which we will go over later. But at the end of this, it should look something like this. And I mentioned this because this is not a trivial task to do because just to give some context, like one month of data for like July this year, like July 2022 is like about 38 gigs of uncompressed CSV. So you'll need a decent amount of resources if you were to do it directly in memory, which we don't, we have some tools for that. But yeah, so the first thing is to get into a format that you can actually explore and build models with. And then once we have it in this format, it's quite straightforward to do some basic analysis. So first of all, things like how, what the population distribution of models looks like and how that has varied over time. So I think this is one of those things which makes it a unique dataset is that there's a very, very wide variety of models and vendors in the dataset. So these plots are similar to like what you have seen on the Grafana dashboards before. So I'm not going to go into these too much. But yeah, so normally they should correspond to what you've already seen before. Cool. So other than that, we can also, well, since we're concerned with device failure, one thing to look at is how old are the actual desks that are reporting data to us? And so this data is from Jan to April of last year, 2021. And looks like most of the desks that do report data are about four years or less. So they're quite young. So that sets some expectation about what kinds of failures you can see. So there's either those sudden outages in those kind of failures versus those slow, gradual failures and just going off of like the age distribution. It seems like the former is more likely to happen in this dataset. Okay. So that's basically some of the exploratory data analysis and I can go more into that in the demo part. But once we have done this, we want to look into some ML approaches that we can actually apply. So since we don't have labels, it's not possible to do like a binary classification of like pass, fail or like failing versus working. But one thing that we can do is something of a time series analysis. So we can look at the metrics behavior over some past interval and then predict what those are going to look like in the future and then let the operator or the subject matter expert decide like whether to take that disk out, whether to keep it in or so on. So let's look at the time series analysis. So right off the bat, one thing that I noted here was that it's either pretty much flat lining data. So most metrics will be just flat, no major changes until the very end, I guess. Or you'd have a jittery or like a time series that almost looks like a bunch of step functions put together like that. So basically it's going to be a non-stationary time series. And to deal with that, we try to do a couple of pre-processing things like differencing and log transformation. But still the takeaway is that because of the nature of the time series, it's going to be much more difficult to predict like a long-term forecast versus doing like a short-term forecast. Okay, so in the basic notebooks that we have here, the setup that we used was we're going to look at like seven days of horizon in the past. Because that's how much data I think we're guaranteed to have in telemetry and stuff. And then predict the next seven days. And again, we don't want to keep it too short because knowing just the stats for tomorrow is not going to be super helpful, but also because we have a shorter look back period, then it's not possible to make a long horizon predictions. And okay, with that setup, the baselines that we have are, well, there's two baselines. First is predicting that the next seven days are going to be the same exact value as the previous day. And the other baseline is the next seven days are going to be the same as the average of the previous seven days. So these are just some simple, very, very basic baselines. And if your models can't beat this, then it's probably not a worthy model. But with that said, we explored mostly ARIMA family of models, Profit and the baselines. So based on the initial experiments, it turned out that the profit with the logistic growth turned out to be the best performing model followed by ARIMA, the baselines, and then a vanilla profit model. Also, it's kind of tricky to have like one single evaluation metric to call something better or worse because the way that this time series analysis has been done is that we're building models and predicting for like different time windows for each device, for each metric. So it's hard to aggregate all of that information to one singular metric. But that said, the metric that I had been using in this notebook was mean average percentage error. Cool. So let's quickly go into the demo as well. So yeah, all this, well, not all of the data, part of this data has been made available as a Kaggle competition. So we can look at it, try to beat the competition or submit your models and so on. So we invite you to do that. And in addition to the data, we also have like two notebooks to get started with the data. It's probably better to show here. But yeah, so first is the exploratory data analysis that basically makes it easy to load data in any time frame and then use Dask to do out-of-memory processing on that and easily work with data that is much larger than what you can have on your device. I think we're running out of time. So I'll skip the majority of the demo. But so we have some functions in there as well to make it easier to get your data in a format that you can work with and then do some exploratory analysis. The distributions, the valid and valid over time, can you build any heuristics that can indicate failure and so on and so forth. Cool. And then we also have the time series analysis notebook where we basically look at a couple of device UIDs and then for a given set of critical metrics, we try to build some forecasting models. So this is what I was talking about, the general time series behavior. And as you can see, it's kind of a step functioning or jittery. So we're gonna build like a model for like one of these metrics. So starting with the baselines and then going into the ARIMA models. So for ARIMA as well, we're gonna do only some basic tests to determine the parameters of the models. But of course you can, if you can fine tune the hyperparameters to get a better model. But for whatever kind of model you wanna work with, the data is here for you to get started with. Okay, cool. Going back to the presentation. All right, yeah. So one of the things that we had for the audience as a call to action is, so by default telemetry is gonna be turned off. So if you wanna contribute to this dataset, we recommend turning it on. In order to do that, you can either use the command line just by saying step telemetry on or you can also use the dashboard on your, your step dashboard to turn on the telemetry. And before, if you'd like to preview what data gets sent before or even after turning on the telemetry then you can use these commands called show all which shows you what data you are sending right now to self. And if you wanna preview what data would you be sending if you turned on telemetry, then you can do like a self telemetry preview all. And in terms of the dataset and the analysis, like I said, the dataset is full, the full dataset is available. A chunk of dataset is also available on Kaggle along with some notebooks to get you started. So again, these notebooks are very basic getting started stuff. So it's not gonna be state of the art performance. So we invite you to take a look, see how we can contribute or improve these models. And yeah, so the next steps for this project is, well, first the metrics that we collect now are smart metrics, which is a pretty general industry wide standard kind of a thing. But some vendors could be working on specific metrics for their own specific devices. And we wanna collect, start collecting those as well to allow for more accurate information. The second thing is in addition to these metrics, we also wanna collect some host-level metrics to things like latency, performance and things like that. So some recent research seems to suggest that these things are important as well. And finally, we wanna make sure that this is not something specific to set. So we wanna have a standalone metrics collector slash device failure predictor that anyone can use on their singular disks or their storage clusters and so on and so forth. And not be limited to just use them self. And I think that's pretty much it for initial next steps. Right, I didn't see this time. Okay, so the challenge that we did run into while working on this was that, well, first there's no ground truth labels. So there's no classification problems possible. There are some possible heuristics, like if the host is active, but the device is not reporting, then maybe that could mean something is wonky. So you could use some heuristics, but we haven't explored that a lot. And finally, the time series analysis that we have here is a univariate forecasting, but obviously these metrics are tightly coupled together. So if you were to take multiple metrics together in the forecast, then that will improve the accuracy of that. And the other major challenge that we had is there is no universal definition of failure. So different operators and different data centers would have like different tolerances on when they would want to take out or keep a drive in. So that's a, because the definition itself is subjective, it's hard to build a model to track that. And in addition to like operator preferences, there's also differences in the type of the devices as well. So failure for HDDs is going to look much different than failure for SSDs. And even within for like interfaces, like failure for SATA versus NVMe is going to look different as well. So in general, there's a lot of difference in what it means for something to fail. So that's also one of the things that has been a challenge. And I think I'm not missing any other slides. So yeah, so that was all for this talk. Thanks for staying over time. And if you have any questions, you're here to answer them. Thank you. Good morning, folks. Welcome to the second talk of the session. This talk is on modernizing HPC in the hybrid cloud via open source. Our speakers today are Marius Vogueveci. Marius is the chief solutions architect for the financial services team at Red Hat, working on digital transformation, big data and AI. We also have Eric Rosenbaum. Eric serves as the chief technologist in Red Hat's global FSI team, where he helps clients keep their strategic priorities with the use of open source tickers. Thank you very much. So today we're going to talk to you about the modernizing high-performance computing in the hybrid cloud via open source. As I've been mentioned, I'm Eric. I'm joined here by Marius. So Marius, may I take this through? Of course. Thanks, Eric. So first to show hands, who here is a Red Hat artist? Awesome. So we're among friends here. So for free to interject, come up, question, ask. What we're going to talk to you today basically is how open source software and what additional challenges high-performance computing has besides, well, performance. And how specific, how open source software can help will become those challenges. And you see a number of projects in here, which are part of the demo that we'll see later today. There are many, many others that can contribute to the success of high-performance computing projects. What I wanted to say a little bit is open source, as you know, you're an open source company, has moved from being the provider of equivalents but more accessible services to becoming an innovator. And that enables people to think in different terms, enables them to do things better but also kind of start thinking of new and better business processes. And that's something that Eric and I, they're working the financial services industry by bringing out experience on account. So before we start, I'm just going to put up a quick demo in here to kind of wet your appetite on what we're going to talk today. And very, very quickly, what you see here is essentially a risk calculation process. It's a tiny, tiny little version of a risk calculation process that's basically consisting on a number of phases, data need to be brought in, portfolio information need to be brought in, that needs to get together. You have some calculation processes that take portfolio data and evaluate the risk associated to those portfolios and I will walk you a little bit about why that happened. Now, none of them, is any of them like a want or something like that in the room? Do you have anyone? No. So you probably kind of, what you need to know about this is, these are very calculation intensive processes. So you have an example, basically their flow, there are many other tools that you can have like to have a process workflow in there. But what I wanted to show you is an end-to-end process that basically reads data from an S3 bucket, launches a number of parallel processes. You see the pause basically associated to the calculations, being started, completing, doing their job, and then the aggregation, the results being aggregated as opposed to down, right? Nothing surprising here is kind of your garden horizon after this process wrapped together in a completely different process. We're going to talk a little bit about, like we're going to talk later about how we got to it, but let me kind of hand it over to Eric. We've walked you a little bit through the kind of the concepts of what they're talking to. Thanks, Matt. So what I want to talk about in the next few minutes is a modern interpretation of HPC, high-performance computing. For those that don't know, which is ground ourselves, what's our definition of HPC? Traditionally, HPC was supercomputers, praise, stuff sitting in the department of defense, department of energy, and so on and so forth. We're taking a broader definition of HPC. HPC is a cluster of computers that do large jobs in a distributed way. The question is, how do we distribute that workload, the job schedules and such, but that's how we're interpreting HPC. And as I mentioned, HPC is really more than just a supercomputer. In our modern interpretation of this, where it moves from academia into commercialization, we need other aspects of this, scalability. Beyond just, I need a huge computer to run these huge problems, but it's also about scaling down as well, because businesses have to pay for these things. And how do we scale down to zero? How do we scale down low? So we're not paying for compute when a cluster is idle. How do we have orchestration as code so that we know what's happening as we version control these things instead of just a bunch of batch commands? We can visualize, as Mary showed, in a DAG, for example, of all the pieces, what tasks have failed because HPC jobs will run an enormous number of tasks. How do we scale down? How do we restart it? Where are we in the process? Manageability and automation as code. So towards the end of this, we'll show how we can do this with tools such as Ansible. Standing up that cluster to, again, make it as simple as possible and reproduce it as possible. Data. Data is a critical class back to HPC. We have compute. We also have the data. We'll talk about patterns later in the presentation, different ways we can have data accessible by the compute engines. And observability. Tied together with orchestration. What happened, when it happened? Audit logs. What failed? How do we restart it? So all five of those tools, us, are important in a modern HPC system. So one example of HPC, there's many of these computational fluid dynamics. There is seismic work for energy. Steven's well aware. There's risk management as well and financial services. So in risk management, what are we doing? We're trying to understand what is the risk to a bank. That could be risk because you borrowed some money. You've been alone. What happens? So banks are always interested in understanding what is the risk and doing it in a quantitative way, which means it has to run a lot of different simulations to understand what happens if prices move this way, if interest rates move this way, if markets perform in a different way. So to do all this, I'm not going to dive deep into how you have to market risk. What's important to understand is that there's a lot of different parallel tasks that run. So people may be familiar with, for example, Monte Carlo simulations. You may run 100,000 different simulations with different inputs and then figure out what is my likely scenario of happening. Again, very highly parallelizable, something that's very akin to a high performance computer. Aging-based modeling is another approach where we can figure out if this actor does this, then this actor will do this and that actor may do this. So you're following a path through, again, a lot of scenarios, high levels of parallelization. So risk is a good example of one type of HBC workload. Okay? Maryse, we'll take us through some of the few patterns. Yeah, let's talk a little bit about, like, you know, the different ways in which, you know, these systems can be handling functions. Then we're going to cover two. We're going to talk a little bit about the patterns that exist for, you know, organizing the computer. And Eric will talk later about the pattern for organizing the data while we process it. These are attended to two more critical pieces of data. That's not to leave aside, like, the feasibility and others, we're going to cover that. But we're focusing on these patterns right now because traditionally, and what we see on the screen right now, is pretty much how everyone is doing. Like, almost everyone, or, I would say, a large percentage of companies in different industries are doing it this way today. Traditional high performance features. Focus on getting the best, if you power the best you use, the fastest networks, the kind of, the most performing storage, putting it together, creating these types of, like, supercomputers, you know, mainframes, high performance clusters, to run these, you know, the large-scale simulation. And this goes across, and this is, as I said, a different type of problems, you know, scientific complications, and financial industry risk modeling. Also, it's making its way into artificial intelligence and training large models like this. The typical technologies for that. Like, the typical model of that is, you know, the cluster, as I said, of very performing machines connected with, you know, hard-performing networks sitting on premises, and, you know, using, you know, scheduling frameworks like Slurm or LSF, you know, or, you know, Spectrum Symphony to schedule jobs on these clusters, you know, and technologies like MPR, as you've heard of it, have these processes in each of these. The paradigm is I want these multiple processes, I have them talk to each other, I have them churn at different pieces of data, you know, until they're done, and, you know, and then they're done at the computation finish. They make extensive use of low-level abstractions. I'm trying to squeeze, like, if I have a very performing machine, as much power as I can from that machine. They're also very dedicated to each other. Like, those types of environments are essentially the control of the computing cluster. Like, very extensive, very, you know, very performant, but also very, you know, very, very dedicated. Why is it funded? Well, it's a very, very well understood, and, you know, it's an extremely popular pattern. It's something that the kind of the different industries have adopted, for example, that have adopted from, you know, from these kind of, from these simulations, and it's easy for a lot of these problems, like, for example, like Eric has described, we've got the Ritz constellation to remodel them as these kind of high, you know, high-volume mathematical problems. Also, you know, having an environment ready at the fingertips is actually going to mean that, you know, I can launch a constellation immediately and I have, you know, I have my frustrations there and just waiting for these to start things up. So, you know, the reduced lead time, the reduced lead time is actually a big advantage. What are the challenges? Like, what's, you know, what's becoming a problem is, well, they're very performing computers. These are extensive resources. They cost a lot of money, right? It's not, you know, these environments are great in almost, like, are great when it comes to performance, but have challenges in a lot of other respects. If I optimize the specific type of hardware, my applications have limited portability. If I want to increase the volume of computation, I want to increase the capacity, I actually have to buy a physical machine and add it into the environment and, you know, configure it and do all the work. The agility, like very often, like things like storm, for example, relying on having copies of the same program running on identically configured environments. So, if I want to upgrade something, whether I upgrade, you know, the operating system, or I need to upgrade the application, I have to make sure that everything is fine or less I end up doing an interesting cluster. So, those are, you know, those are typical, typical challenges here with these types of environments. Again, not to say that, you know, but they're very suitable for a, you know, for a specific type of problem. This is how things are done, but, you know, increasingly there is an interest for applying different types of technology to this problem, you know, and looking at other, you know, models, like with, you know, with the more, with gaining more flexibility, for example, to virtualization, to containerization, to serverless. There are new emerging patterns in this space. And HPC practitioners start looking at other types of, other types of configurations, right? Cloud-native, right? You know, the cloud-native model you can containerize and server-less, make sure it says, essentially kind of takes a different approach. Instead of buying, like, the most expensive machine that I can get, and putting it there, you know, I would rather buy a larger set of cheaper commodity machines. And when I talk about commodity machines, I still can, they still be very large in terms of cores and memory and everything else, just not as big as your typical HPC end-up. You know, it's the focus, the focus here is, you know, getting this horizontal scaling and, you know, having some flexibility in the way, you know, applications are, you know, distributed across this compute node. A complete set of compute nodes. Typical technologies are like doctors, for example, for containerization, or having mutable copies of old dependencies that, you know, that can run. Rightfully, Kubernetes, which is a container orchestrator, you know, that enables you to, you know, run these before close in a, you know, in a flexible fashion, as they say. But also, you know, technologies like Q&A before anything like that, you know, adds, you know, server-less behavior to these deployments, like the scale to zero, the ability to scale on demand and then come back very, very quickly and you'll see an example. And, you know, other technologies enable you to control the way you scale. Also, it involves, like, you know, when you kind of go into the cloud-native mode, you have, you know, higher level of abstractions of it. It's like a larger spot. So what do you get? What do you, why practice should we say? Well, I'm willing to take a hit because if you look at the challenges, like, these are environments that are not, so we're not as performant as HTC environment. That's the reality. They're also sometimes not very suitable for certain types of conflations. Like, Kubernetes, for example, is notorious for requiring a specific, like, you're requiring customized schedules for running these types of jobs, right? So there's a lot of research and saying, how can you make it better to do so? But there are benefits, too. You know, you get better utilization. You can have, you know, instead of having a dedicated cluster and figuring out, you know, how to schedule it to run different types of jobs, you know, you can, like, getting the flexibility of container orchestration, you can decide, for example, to allocate a time better or want things in parallel with better control. You also have better resilience, right? The container, like, you know, will offer you something like self-heal. Something like sperm, for example, doesn't necessarily offer you self-heal. A job starts, it will tell you you could fail, but it won't you start doing anything. So you get this a little bit for, you know, a different a little bit of... In summary, while these types of environments are not exactly as performing, they offer a number of, you know, benefits when it comes to, you know, utilization, as I said, resilience, elasticity. It's much easier to scale than horizontally to, you know, efficiency maintenance and coming back to optimization than, you know, the challenges, like, you know, challenges obviously are, you know, that increases time and higher costs of units per unit but compute, for example, if you're buying, you know, compute for a quality environment. Of course, again, this doesn't include having hybrid approaches, right? Things like MPI, for example, processes running as pods in Kubernetes, you know, are a common strategy. Slorm can run containers, you know, containerized jobs, you know, you can have, you know, public health providers, for example, provide preconfigured HPC environments with all those features so that you can apply the typical idea. But in summary, containerization and serverless kind of offers an alternative, you know, that kind of provides a number of qualities to, you know, about, as I said, utilization of units specifically and so on that for business environments could be more critical than just performance alone. So that's kind of, you need to be very, like when you're thinking of going one way or the other, you need to understand which one is the part that kind of matters the most. And it's the pattern that is used to drive on. And we're going to see this pattern that's going to be a deeper bite. Yes, we're not going to do this one. We're not going to do this one. And what you saw, actually what you saw in the demo in the beginning, right, was exactly like an example of a containerized job running, being able to stake a container starting up, you know, running the job. We're going to also see the serverless version of it for some time. Sorry, I can have them. When we do good slides, though, I mean, I don't know. Thanks, Marius. I'm going to spend a few minutes just talking about data patterns. As I touched upon before, data is one of the five pillars of a modern HPC system. So traditionally, keep data in place. So again, if we take the traditional approach where we have HPC, we have the compute with the data, you know, we load the data up, the HPC is right there. Data is right next to it. We have the functionality of data, which is great for performance. So as data is kept in place, there's no copies. We have copies, we have copies, you have differences sometimes. Failures, moving data around. But here, it's all in place. You also have a reduced risk of regulatory issues. But I mean by that, at least in financial services, presumably the same for healthcare, when you move data between geographies, there will be restrictions through the privacy data. Name, social security number, things like that. For Switzerland, for example, not allowed to move data outside of Switzerland of a customer unless it's encrypted. So if you're keeping it in place, reduce risk of regulatory issues. The challenge is though that you need the data next to the compute. Not always possible. You want to scale it, for example, if you have your transactional systems in-house, an order management system, you have your MRI system, for example, but you want to do the compute someplace else, this is a challenging model. You have to have that HPC compute next to where your data is produced. And if you're using the same data set, you have a risk that analytics loading data could affect the performance of your operational systems. Sure, there are reads, but some systems don't have enough spare capacity to support those reads while they're processing other transactions. So a second matter is replication. Move the data. So in this scenario, we're presuming the compute is separate than the producer. Okay? Our HPC is someplace else. We need to move the data. Different approach. I see one slide. Oh, there we go. Okay. Data is replicated. Different approaches, change data capture. So you say you can use a storage vendor such as EMC's replication. Obviously, many of the SQL vendors have replications as well. A lot of different ways to move data. Sometimes files, sometimes SQL, sometimes images, what have you. Okay? What's nice about this approach is that you have the ability to do transformations, obfuscation, tokenization. So if you're dealing with privacy data, take out the PII if you don't need it. Okay? Or if you still need it for a reason, you can tokenize it, and then when you bring it back, you can de-tokenize it. So why we would do this, it offers an opportunity to optimize the data for analytics. The shape of your data in your golden system may not match what you need to optimize for analytics. Okay? So this gives the opportunity to do those transformations. And you would then have the data close to compute. Okay, for it's faster, so on and so forth. This is reg risk. Again, you're moving data. You've got to make sure you have policies in place that you're allowed to move that set of data where you're doing it. Okay? Obviously duplicating data is then additional storage needs, cost. If you're dealing with Amazon or Azure or Google or what have you, they charge you ingress and egress costs for data. You've got to pay for that. And resilience and orchestration. It's one more thing to do, one more thing to do. So this is a third-party metric. A third pattern, kind of somewhere in the middle, I'll be honest. The first two are much more common. Third one has its place, but it's but anyways. You can use the problem. Always a problem. Remember that. You would think it's close, right? Third option, data stays in place, that data remotely. If your data set is small, it makes complete sense. This is a great pattern for an MVP, for example. You're getting started, you don't want to deal with moving data, copying data and so on. Just open up VPN, grab your data from your domain source. It gives you benefits of both patterns one and two, but it doesn't scale well, to be honest. There's a large data, there's cost, there's latency, and so on and so forth. Obviously, that's the way I'm talking about. So, enough slides. Okay? We really suffered through it, upslides. Now we're cooking. So, you know, just to add to what my environment, to add to what Eric was saying, one thing that we find in these industries is that people design anything for cost optimization. You can feed everything that you want to support. The last thing that we learned is the cost of net. The net that costs it, how much it transfers costs it. It's great to have these like online accesses and everything. It's just sometimes there's like a lot of cost that comes out of that. Okay, so going back to my previous example, let's kind of dive a little deeper into it and I'm going to talk a little bit about what's happening here. So, you see here, you know, I codified you know, we have a codified you know, process I was about to get it. I just wanted to kind of start like I have a codified process of, you know, some logic, some steps that are taken to view, something that looks like a risk housing. It's a very, very small thing. You know, as I said, garden bright and not produced, a number of steps kind of capturing some of the things that Eric and I were talking about earlier. You know, for example, you know, how moving data, for example, from an SG bucket, getting it ready, you know, making it local to the process by populating the cash, you know, extracting it, slicing it and dicing it for you know, the kind of the scale delts processes that process that data you know, kicking off a number of pods that, you know, kind of that do that processing and aggregating the results. That's the process by logic. For, you know, for the purpose of this demo we're using Eric, because I think it's a great, you know, it's a great project, it's a great illustration of how things work, it's popular in the industry, you know, it captures a lot of the qualities that we're talking to. There are other, you know, there are other similar frameworks, you know, that do similar things, you know, and you know, I wanted to kind of look at the benefits of Replove in the first place, which is essentially this, you know, this aggregated framework for creating, you know, graphs of tasks that perform various steps, you know, coordinating them, making them as an aggregated process view, adding things like, you know, as we will see later, logs in the turbability around this. But what you should really, really think about is the concept and the principles that, you know, this is something that I'll use there for later on, first. But don't get too kind of bogged down into the details of this implementation. What we wanted to show is that I can take data from the system, I can start a number of pods, I can have them do the work, I can aggregate the results and get done with it. These pods essentially contain the logic actually of, you know, for translating the bar, you know, something that Eric has written, you know, it's just a simple application and, you know, they're kind of aggregating all I need to do to run the computation, you know, it can run flexibly on my OpenShift cluster and this is exactly what happens when I kind of launch this, launch this, that. And I kind of come back, come back again and I can see the, you know, the pods being started, the containers being created, the jobs running, you know, doing the work as we saw earlier, you know, completing the work and then shutting down and moving the way. This kind of captures a few of the things that we talked about in the, you know, in the in the containerizing serverless path, right? You see the encapsulation of all the code into one deployable unit that is the container. It has everything to do through it. You see the flexible schedule, right? You can see the kind of the schedule basically kind of creating the pods and run it again. I was talking to the, you just, you don't see what you can end the screen. You know, you can see the pods basically you can see the pods basically being you know, you can see the, you know, you can see them scheduled on different on the appropriate nodes and using all the capacity. You can see some information about like how they run. You can go into the I can go into the team itself and get some information about, you know, the logs, for example, like you see the log for a specific task, you know, I can see a lot of information that helps me maybe debug if things have failed as it happened here in the begin. The code for this, you know, we saw also coming from a, you know, what you saw is the code, but it's actually coming from Git. Like all this, like all these pipelines are actually checked in, you know, deployed automatically, you know, and you can see that, you know, as with every good demo, the latest them it was two hours away. So, really playing the pipe. But it's giving you also an idea of how agile you can end this process. One of the challenges, for example, is that, you know, when we kind of go through the particular HPC space, is that you have to build the application, you have to compile it, you have to run it. There's very kind of, it's not that easy to simulate what's going on, what's going to go on in a real world of work. In this case, like with these containerized jobs, I can pick them, I can run them, for example, in a very small data set, I can iterate very, very quickly through it until I'm happy with it, and then I can set it up for staging and to be run. That's another quality that everything needs to do. It needs to be much, much faster. Right. Okay, so, you know, we're kind of talking about this, you know, we're talking about these, like this model. And this is kind of very, very close to the kind of traditional model in the sense that, you know, in order to get here, I have to think about, okay, how I'm going to partition my jobs, how many, you know, how much, how much speed do I want to get, right? How much I want to accelerate my process. And if I have a thousand records, and it takes me one hour, and then dividing by five, and still getting into five tasks, it's going to, maybe going to help me kind of, you know, get, get done in 12 minutes, right? I get this, and I can parallelize things like that. How about if I, you know, get flexible, and let the platform decide how we scale. I want the platform basically kind of control that scaling block. I want you to go, you know, I want you to go as much as you can. And also kind of go in and scale down to zero, if needed, right? And maybe reuse that code. This is application, for example, for making individual requests. And this is another flavor of what I'm going to show you right now, come into play. Right? So we have this serverless model. And the serverless model is different. It's almost the same, follows exactly the same steps. You know, reads the data, populates the cache, does everything that it does. But instead of doing the, instead of launching the pod itself and controlling how the tasks are launched, it will split the data set into a bunch of small individual requests and it will delegate the processing of each request for a serverless server. And the serverless service lives right here. So as you see, this is, you know, right now I don't have anything, it's also scaled to zero. So what I'm going to run, what I'm going to run this, that what's going to happen is it's going to do exactly the same steps. But now, as you see, the process is starting up. And you know, due to the settings that I have, for example, it's going to start another, for example, it's going to start a number of services, actually handle all those requests. You will handle them in parallel. And when the process is done, it's going to, you know, it's going to basically start scaling down again. And you see you know, the you know, the deployment is going back down to zero. So this is another way of doing it. And this is another way of harnessing the kind of the power of the platform. Now, one thing that you're going to ask, and if you don't, I'm going to ask the question instead, but Marius, you showed us one way of starting a number of pods and going to zero and then you're showing another way of starting a number of pods and going to zero. The difference was in the first case, I had to make a conscious judgment like up front how many of those pods I want how many how I'm distributing data. I have to design it like that. The serverless model allows me to kind of delegate that to the platform, allows me to be more aggressive, for example, in scaling up, like if I know that I'm going to start by pods, I'm going to think up front and I'm going to start by pods. Maybe the serverless platform is going to start to tend and finish quicker because it has the necessary resources to do so. So serverless actually is a very good technique when I want to burst. Yes, I'm going to continue to market more resources and everybody is going to hate me for it, because I'm taking resources from the entire cluster, but that's going to happen for a very, very short amount of time. And when I'm done, I can continue to my computation quicker and I can get more faster. Again, coming back to the problem of HPC, this is particularly suitable for these computations where I have individual items, I can break them down like this and so on. Maybe not that suitable for something that's a large matrix, computation that you want to have. But business computations are very often more into this type of category. The risk calculation this is called, because this example is a large volume computation that you can do. Now that's kind of that covers kind of the demo part, the demo part of our talk and before we go and kind of go to questions, I want you to point out a couple more things before we do. You've seen this process, you've seen the scaling, you've seen the model for processing data. We talked about a little bit about infrastructure as code and about how this goes hand in hand with the flexibility of running it. So if I'm running things on a set of dedicated machines that sit there all the time and so on all I have to do is just to keep them maintained and make sure that they're up there and working. If I want more elasticity, if I want to be able to go and say I want to require this to be powerful and I want to kind of set up APT clusters and I have to do more elasticity. I have to go and craft these environments, I have to craft them, I have to produce a lot of automation in the cluster. This is the domain of things like Ansible for example. How many of you are familiar with Ansible? Now, the important part is not only to use automation, it's actually complemented with DevOps practice, with having that code check in available, repeatable and that's just kind of key in the process. It's not any too HVC, but if you're planning to set up a large environment, then you want to have that. It's important. Also, and there's a lot of things happening on this slide, we should have nothing to do with or have something to do with our conversation today. But I wanted to put this concept of infrastructure as code, putting everything as code and automating translates very well into, you know, translates very well into the way you work with data. You see something like, where does our data come from in our demo? It comes from a it comes in this case from a Nuba bucket. Everything is set up using resources that are checked in repeatable. So I want to create this environment. I want to recreate it at home. You can't. The second thing, I think that's also pretty important is, you know, when you think about when you think about the type of pipeline, we thought about the process that we use. Again, that one's also set up in the source. All my data pipelines exist. Exist resources that allow me to figure out exactly what where and where it came from and what's the meaning of that. And that's another important thing. So applying this automation, applying this notion that everything that's code combined with versioning and abstracting this key for implementation and that's it. I mean, we kind of went through kind of, what's the conclusion? Well, everything that you said so far. HPC, more than about performance, you know, you saw some, you saw how containers and serverless can help set up scalable jobs and you can control the electricity of the process. You've seen orchestration as code. You've seen, you know, moving into the automation as code and running as code and how you can set up the cluster. It'll probably take more time than we have for the epochs. So we kind of cheated on that. If you like the presentation, you're going to give us a pass. And, you know, we talked a little bit about data and how data moves. Again, we can go into all the details, but we've seen a process that kind of extracts the data from this review. So, and we talked a little bit about the ability of the architecture. If you've seen something of this whole presentation, is this notion that, you know, you have to think a little bit outside the box, outside the kind of the water performance element you can think also about. So with that being said, I'm going to turn it over to questions. Eric, do you want anything to add? No. Hi. This was a great talk. I learned quite a few things. I myself and I'm working on better clouds and multi-cloud. So this is very much of interest to what I'm doing. So what I wanted to understand that, if I understand the message of the talk, that you know, there are two ways of doing HPC. One is the Slurm way of doing it, which is maybe high networking and there are parallel jobs and they can be done in the cloud also. I wanted to understand that how is this different than running Spark cluster? Actually actually when you think about when you go here kind of like if you squint a little bit, Spark is one of the ways like Spark Spark, like you can have for example, Spark running on the cluster and you have a job in there and Spark is usually one of the alternatives for Slurm, right? Now that comes into play. Now, when it comes to the kind of graph, usually the way it's done is launching the Spark job is one of the components and one of the steps in the graph. So there is a larger process. You still have to bring data to Spark, you still have to coordinate other things that they can be set up. So you don't have to do that in this case? You don't have to bring the data to a bucket? You still have to do that right here also? You don't have to do like in our case we didn't cover the part how the data gets into that bucket. Again, there's only so much time that we have but you're absolutely right. You would have to like it is there is a possible process that would do that I do have more questions that I'll let others ask then I'll come back. Feel free to reach out to us after the talk and we'll have more than happy to have a longer conversation. I was on the very very fringes of HPC at my last gate before I came to Red Hat but now that I'm at Red Hat we're working on a system called validated patterns which is kind of oriented around building repeatable things using GitOps and a lot of the technologies that you showed here. Generally speaking for validated patterns we're primarily ARGO-CD focused as opposed to Ansible but our most recent pattern also includes some very important Ansible elements and wondered if you'd be open to kind of further conversations about how to kind of expand this Absolutely. How can we hope I'll swing that in front of the question? I'm going to put my sales path and say how can you get us faster if I can get you faster That's what we're all about I'm from ecosystem engineering we can talk offline. Fantastic. Just to kind of comment on what you said, like ARGO if you look at something like here if you swing it again ARGO would be part of the you know ARGO would be somewhere in between here Ansible is one of the things that you can run. Definitely ARGO plays a role in this and absolutely we'll have that conversation. Sounds great. Thank you. So there was a point at some point like you made the statement that you know when people design such that they don't think about the cost of the network and I feel like you didn't finish that point where were you going with that? I was waiting for it it never came It was, let me kind of come back to what I was saying looking up on Eric's conversation, for example remote access and data Let me give you an example There is a there is a tendency to think of things like let's say disaster recovery People think of like using replication or kind of using some sort of streaming data in the continuous replication process between different clouds and between different environments and doing that What Eric was describing is if I'm having the data locally to give him an example I'm going to copy it once so I'm copying like 10 meg of data and throwing it in If I'm keeping it remotely and I'm connecting to a remote access I will access that data maybe 10 times I'm going to make a copy 10 times as much data as I'm supposed to So not only I'm incurring like the risk of having like all the layers you're moving a lot of data from the pipeline but I'm also ending up with a large networking bill And not to mention fragmentation is something that's wrong Exactly, I'm not even like we're just focusing on the economics of it not the kind of what can be wrong So when you said cost you're actually talking about real money talking about performance No, I'm talking about real money I'm talking about multi-cloud environments and failing to design the cost of cloud providers love to keep your data even more It's very easy a lot of people look at the cost of this and the cost of the data I've got to run this long I've got a box this high I'm moving that data out That's where a company really gets defied by the bill So isn't that defeating the idea of containerization in the cloud like aren't you in fact encouraging more copying or not or did I miss something? Intense, so it's really so it's very mutating So for example with I'll pick that if you have a people they want a real-time reputation on a traditional team they want to get defied because I'm sitting here because you've got automatic on-prem you've got your analytics so if you want to break up every fight-chain over the cloud physically you know because I can say this you might update it 100 times but then run your analytics every 15 minutes so you add it and bring it over except for a certain use-case the point is that that cost is not a real cost you're saying that when they want to configure their job they have to give that as a parameter in mind while designing their job is the point yeah there's a mention of and one last question is about the functional service when you're saying that I do not have to worry about how to partition my job it takes care of it and the platform will spin it up and take up all the resources out of there so that's why the Spark thing came up in my mind Spark does that if you do not give your application any resources that you go and take up all that is there so the question was one, are we having some sort of heuristic that this job functional service are not very long-living jobs and if they're long-living what happens if the next job comes and it doesn't get a chance to run on the platform how does it manage the resources you have taken over everything my job, second job comes in what happens pardon so my question is is it fair share or I get evicted or that depends on that does the platform does all that now the new guy has come in I'll give half to him half to you or I'll make you wait and give everything to him this case is not the example that you saw with serverless is not preemptive all the requests the service just treats individual requests in my case my calculation consists of sending those requests to the service but at the same time Eric who wants to test a specific portfolio can make an individual request to the same serverless service and find out like one individual portfolio how does it fair if a new job comes in I guess the question that you're asking is if I want to run these multiple jobs how do I organize myself you can configure that how does the platform manage two different questions for example the first submission that you did you mentioned that if I do not tell it how to partition it the platform takes care of it and if it finds enough resources it's faster than what I would have made myself what I tried to say was that so let's just clarify this because I want to clarify this maybe then if you were for example if you want to start a new job right when you're using the same service for doing the calculation those will be just new requests coming in so they will be treated sequentially sequentially and it may scale up even more if it has the resources so it accommodates the second server I think my question is what defines this capacity of resources it's configured in the the platform is pre-configured to a certain capacity say 100 machines so the functional service always has access to these 100 machines now whether other and when you say that if there are free jobs I will enter all 100 machines at any given point of time does it run more than one job or not the functions are running functions are running the functions are independent requests running concurrently concurrently but you can set up parameters that for concurrency, running in parallel, how you do that so you have that the platform for that use the platform to configure those parameters and you can also if that you need that more honestly and to kind of finish up what you were saying if you are interested in that type of there are for example then you want to take a more batch-focused approach which is perfectly fun what we showed you is not considered progression if the container is as good as the container is we are trying to do something kind of different and interesting but if you are looking at preemptive jobs there are schedulers in Kubernetes for example that allow preemption of jobs so you can say I want to my job takes priority you are important marriage is important but you are even more important you are more important you are important so my assumption is not even about preempting I am not even worried about whether my job will preempt or not I am only concerned whether my job will take longer because my example was more like that I submitted right now and the platform had 100 nodes it ran it on all 100 nodes the next job came now my job has reduced to 50 nodes that means it took double the time do I care about it or not I was coming from that point I understand I know we are going back to slurm now with batches all the important stuff we have to touch upon here the idea basically was these are two options and how people can think other than slurm is what you can do functional service or you can do parts on your own and you can think about it how to package your job to use the cloud at some point you need to have some sort of we can take this as a conversation you need to get working you will not be able to run two crazily scaled out jobs like you need to know these types of highly scaling these types of workflows are designed to run at times I want to for example at night when my cluster is doing all the transactions and stuff like that it is running my user applications but at night I scale them down and allocate that capacity for for the burst of work and that is where I kind of in any company it is a longer conversation there is no good question we didn't cover it but you are right there is an entire perfect let's follow up on that there is an entire economic aspect like who can charge backcharging and let's follow up on that thank you thank you very much sir another one for the default state that would have been both embarrassing very very hard