 Ceph principle architect at Red Hat and Yaret Hattuka software developer working with Ceph and They will be talking about adding smart desk failure prediction to Ceph Take it away. All right. Hello everyone. My name is safe while Hi, I'm Yaret Okay, yep, so today we're gonna Talk about adding smart failure prediction to Ceph. We're gonna start with a little bit of background What Ceph is why we're adding smart failure prediction We'll talk about the journey we took in order to arrive at the architecture that we did the current status of the project And some next steps and then we'll talk a little bit about outreach II just an internship program That led to this whole project Maybe one of the most exciting thing about this project was how it brought together The open source communities along with the outreach you internship Of course Ceph open source community and we gained a new Industry participant in this project, which is really Excited about this project profit store Yeah, and this is how open source does its magic which is not trivial Sure This better this way Yeah So probably all of you here know what Ceph is but just a quick recap Ceph is an object block and file Store system in all in a single cluster. It is Designed so all components scale horizontally With no single point of failure It is software defined so it can run on commodity hardware And it is self-managing Wherever possible, which is really relevant for this project Because we wanted to make it even more possible to be self-healing And maybe again the most important part is that it is completely free and open source You can use it you can see all of the code which is It's awesome Thanks So the concept was to teach Ceph how to collect Health metrics from its devices And then to pass it to a pre-trained model that can predict whether a device is about to fail or not Or when it's about to fail It was important for us to keep the Design modular so we can either use an in-the-box model or to send all the collected data to Service on cloud make make it to the prediction and then get the Prediction back to the cluster and then with this estimation of what's If the device is about about to fail or or when we want to teach Ceph how to respond to any man and failure before it actually happened Which makes the data even more safe It brings us to reliability It's a hard fact, but Devices eventually will fail. Does anybody here ever experience data loss? Terrible losses of photos documents This still hurt on a personal level Yeah, it's it's it's not very fun to lose data for business businesses. It can be devastating and So we all Use redundancy in order to avoid the data loss We will replicate the data By using great ratio coding so it's playing the numbers game just to Know how much we can Invest in that it can be really expensive to to be able to replicate data with lots of replicas so whenever a device fails it Changes the statistics or the redundancy The redundancy gets worse and we will use the replica count and a window opens for Even elevated risk for data loss Larger systems mean that we have more devices and It means that we have more Failures eventually and this is a sentence that using in fact that failure Becomes the norm rather than the exception again hard truth, but It's true fact So if we can predict whenever a failure is about to happen we can act ahead of time and We can preemptively recover that device which Makes the cluster much more reliable Yeah And that brings us to the other part of the performance So it is a natural that the cluster usage has a Certain pattern of Peaks and off peaks periods during the day In case we have a failure we have to respond with a recovery and The Priority is very high. So it might happen on a peak hour, which is not ideal whatsoever So if we preemptively Recover we can schedule it for A time which is off peak, which is fantastic Yeah, and with stuff in particular recovery can be and can have significant impact on performance Maybe say do you want to say a couple of words about that? Yeah, this is sort of a I think in any system you're gonna have an extra cost that you pay in order to do the recovery It's a problem that we've struggled with in Seth's making that as little as possible But it's still something that you ideally want to do it in an off hour if you can So when we started this project We're sort of blank slate Our goal is just to make everything sort of work in Seth out of the box So when you install Seth as a standard user it would gather the metrics who do the prediction Do everything without having to install extra tools or external dependencies? So we would build in the data collection. We would build a simple prediction model The expectation was we would start with something really simple It would be open source and we would hope that the community or would develop something better and it would improve over time But it was sort of a monolithic approach to the problem About halfway through the project A company came along called profit store that specializes in doing AI enabled Data center operations and in particular doing disk failure prediction And they had a SaaS based product that collects data metrics from the customers and has runs a very sophisticated prediction model that they claim like 97% accuracy and Has a dashboard very sophisticated and they were very interested in integrating with staff so that Seth source costors could take advantage of their service And and reap the rewards and so they wanted to work with the community to figure out how to integrate this I'm with us And their goal basically was to provide a free service any set users so that they could share their data with data with profit store get Their predictions back or if they become a paying customer or for example because they had a large cluster They wanted more accurate predictions and they could use that sort of fee for service as well So and they wanted to have both an on-premise option that would work in your data center or be able to use their SaaS service and so As we this conversation evolved with with private store realized that the two completely different models and ways of approaching it We're sort of colliding and we need to figure out how to how to approach the problem So what we eventually ended up with was a modular approach That allows you to swap out different components of the overall pipeline to enable to accommodate both models Self-contained and using a commercial service So it basically breaks down into three pieces and you have to collect the metrics on the devices You have to run some prediction on them. And then once you know what the life expectancy of devices are then you have to You want to make stuff automatically respond and have some mitigation action? and so we Built all these components We want to build all these components into SaaS so that the whole thing can work out of the box It can do the collection It can do a prediction and it can do the mitigation or if you want to use external tools because you're already scraping device metrics using Some other infrastructure Or if you want to use an external service like profit store You can do that as well and by building sort of breaking it down into these three phases We allow you to swap in these different components and use it in whatever way makes sense So the first part of this is gathering device metrics So what is smart? Smart stands for self-monitoring analysis and reporting technology It was It was started as an attempt to to give access to the devices health parameters and it so supplies a Very very simple prediction model whether the health The device of the health is okay or not It defines Several attributes So for example, the device is temperature The number of the hours that the device is powered on number of bad sectors so on and so forth and each manufacturer Defined their own thresholds for these pretty attributes so in case An attribute crosses a certain threshold the device is considered to be failing This is a It's a very nice intention, but eventually It doesn't work many devices will fail without crossing certain thresholds or That it can be expected that the device will fail will fail, but it will not show it on the smart simple prediction So we decided to collect the health metrics ourselves And to analyze it But that came with a few challenges. So for example the interfaces of set up Sass and NVMe they present different health metrics and of course it was the issue of the vendors again that they have their own So they don't have to include all of the attributes of smart. So if you buy Samsung device, it will not necessarily have the exact same attributes such as a IBM device or whatever and That that That actually Put some more overhead to normalize as the data And also they have different scales. So They one vendor will decide On scale from zero to a hundred and another one would be set on a scale from zero to two fifty five So it's not really standardized We used smart one tools smart control To collect the data, but If you guys had the chance to work with that, you know that the output is not ideal for machines So you can see how it looks First of all, it's really important to say that the smart one tools community is awesome They are super robust. They they give They support all the devices out there. They react quickly They're doing a really great job However, the output is aimed towards humans. So you can see there are all sorts of pretty tables in here But if you want to get this data and transfer it to some sort of a model It's pretty challenging There are out there all sorts of wrappers that will take this data and will process it and pass it to sort some sorts of Jason's but but the Thing is that they don't always Take care of all the cases What happens if there's a new device out there? It's not it's not ideal so We decided to do the right thing those of course staged idea and to contact Smartphone tools for A built-in Jason format Output So we prepared a patch those part of Outreach is a plant application process, which is Amazing that It just froze you in the water and says okay get your hands dirty and make a patch for for that specific project So we contacted the maintainer of smart man tools Christian Franke Who does a great job in responding and helping us? We submitted a patch As you can guess this is a long-standing feature request for smartphone tools smart controlling specifically and the good news is that it's about to it's expected to be released by the end of the year which is Just in time for the upcoming step now those release February 19 The second piece of this was that the way that Seth had been Designed and built up until now it can run on any hardware But it really didn't deal with the details of the underlying devices And so we would pay attention to where all the OSD's are and what hosts there are But it didn't have the internal tracking to map that to physical devices And so we had to add a bunch of infrastructure into Seth in order to Maintain that metadata and allows to store it So the first challenge was figuring out how to actually identify a physical device And it turns out that vendor model serial is a somewhat standard way to do that It's what the Udev and block ID libraries use and so we adopted that And so we added this additional tracking into The Seth cluster manager so that all the demons are reporting which devices they're mapped to we have a sort of many to many Mapping between devices and demons so we know which devices depend on which and so on And we had added the ability to store a life expectancy property with those devices so that once we had a prediction We could tell the cluster what the life expectancy was and then it could respond as a result and then We adapted the initial prototypes from the sort of the beginning of the project to add a new module to Seth that would First of all implement a command on the object storage demons Which are actually storing data to scrape the spark metrics with smart control and pass that back to the central cluster The there was a sort of a background operation that would scrape those on a daily basis and store them in a radar school So that we had a recent history of all these metrics for all the devices And we had sort of a self-contained metrics scraping and collection framework The question that we frequently got in this project from other people is why don't we rely on other tools? There are you know, Prometheus plugins for example that scrape metrics and there are all sorts of external tools to do this And then and sort of the balance that we're trying to reach is to be able to make sure that every user Who's using Seth can have something that works out of the box without having to install extra separate stuff But we also because we adopted this modular approach We still leave the door open so that if you already have an external scraping framework You can still use that and have some other Infrastructure that's doing scraping prediction and it can still feed back into Seth to tell Seth what the life expectancy of those devices are And then the same Seth sort of automated management logic can say oh I know this device is gonna fail and and do the right thing as a result Which brings us to sort of the second phase of the architecture Which is the the failure prediction So today we have two approaches and two ways to address this problem Profit store contributed a pre-trained prediction model to the f-stream open source project It's a bunch of SK learn Larbery module files or something actually don't understand the today's science at all So it's a comparatively simple model, but it works it runs inside the Seth manager demon And so you can sort of have an out-of-the-box cluster Analyzing the metrics and doing predictions based on that There's also the ability to enable the feature where it will call out to an external SAS API and some service Either hosted in your data center in the cloud That will feed the metrics to an external service and get a prediction result back and then store that in a cluster So you have both the sort of external cloud model or the the online model And again profit stores goal is to have sort of a free service that you can use Or you could pay them to get their very accurate predictions to do it And because we built this around sort of a SAS API There's an opportunity for other people to implement that same API or one similar and so that you're not Profit store wouldn't be your only choice. You could implement your own external prediction service as well The natural sort of question that we ask is how can we build a better model? So we have the initial model that that profits are donated That's that's relatively simple as I said But the goal in all of this is to build, you know, the most accurate prediction model So we can have the best data reliability that we can and they're sort of two key pieces of this And the first is that we you really need the disk failure data So a lot of academic papers have been published about disc failures and predicting them And they tend to rely with private data sets that the researchers, you know Got from Yahoo or Google or whatever from their big data centers They do their analysis and they publish a paper, but the data isn't public Back Plays is a Cloud backup company that it's very generous and that they publish all of their failure data and they have a huge fleet of hard drives that they use So that's that's really the only public data set that's out there I'm the challenge with sort of both of these is that the breadth of the device models is limited by what those particular cloud vendors Or back plays happens to buy If you look at sort of the enterprise world where you have companies like EMC and NetApp or whatever deploying their devices They're of course gathering all the metrics for the devices that they buy for their customers But again that data set is private and so although those particular vendors might have failure prediction that they build in There's nothing really for the rest of us And so the bottom line is that we need we need failure data more data in order to build build a better model I'm the other thing that's interesting is that There's an opportunity to use metrics that aren't actually necessarily from the device to enhance the quality of the prediction So profit source model for example looks not just at the device metrics that they get from smart But also like things like the server load the network Traffic and how many processes are in the system all this other stuff that they scrape about the cluster and the systems that are Actually consuming the devices and they use that to generate a more accurate prediction of when things when things are going to fail So there's a question of which which metrics are the important ones? And are there even other opportunities that that we haven't thought about And this led us to come up with this concept that what we really want is an open-source public data set of Disc failure data so that researchers and the open-source community can build build a more accurate prediction model And the question is what can we do to help help make this happen? That the concept that we came up with is a SaaS like service not unlike what professor is doing but one That's run for the community in an open and transparent way where you have Systems that are basically sharing their device health metrics with this public data set service So they're publishing their smart metrics and in response they get a disk failure prediction So it's sort of using Providing an accurate hopefully failure prediction as a care in order to motivate people to share their data People are obviously and naturally skeptical of any instance where you're sort of sharing data about your internal systems So it's very important that you make the system transparent and protect the privacy And so for example uniquely identifying devices With randomly generating IDs and hosts instead of having without having any identifying information like IP addresses logged and so forth There's sort of a question around whether you want would want to share your serial number That's sort of the only really identifying information within the device metadata And it's a trade-off because if you do have that you can identify things like you know Bad batches of devices that come off the manufacturing line might correlate with failures But people might be more paranoid about knowing that they might have that particular batch So hopefully the goal is to motivate people to share as much information as possible because you get a more accurate more accurate prediction And then the result would be that you would Accumulate this big database and build generate this data set and then share that with academic resources and the open source community and people who Are trying to build a better failure model so you sort of bypass the problem that you have Currently where there just is no good data to train these things against one of the key challenges with this that we've identified is identifying the failure events because when you're training a model you need to know what the signal is that you're actually Trying to predict when did the device fail after all the hell of these metrics were not and part of that is a definition as Arriving at what the definition of that actual device failure is because that might vary between different users So is it when the device is completely offline and it won't even spin up? Is it when you have too many read errors and you finally decide that you're not going to use it anymore? Is it when you get a single read error and you decide not to use it? Different environments have sort of different thresholds for when they say I'm going to stop using that device and it's it's no longer acceptable The other thing is that when a device fails in the real world Do you imagine in the wild somebody might be running any system whether it's stuff or something else a device fails There's some action that they're going to take, you know, they're going to their their rate array might use a spare They might replace it is they might just leave it failed in place And lots of different software stacks and human interventions might be involved in that And it's I think it's unrealistic to say to require that those users like take an additional step of notifying this Cloud service that they're sharing their data with that. Oh, by the way, I decided that this device failed It's it's if it's not automatic Then they're not going to they're not going to share that information. So it's hard for this Public service to sort of get that signal know when when the device actually failed And the other thing that you have to be careful with is if your failure prediction is working really well Then the devices won't actually fail right? You'll take them out of service before they actually crash and burn And then you have to be careful about making sure that that's not Polluting the model as it continues to be refined and trained So one idea of how to how to deal with this is to how to infer failures Is to associate the device with the hosts that contain them. So typically you have a server that has multiple hard disks in it And as long as you have a unique identifier for the server then the service can see that there are some number of devices that are associated with the particular host And over time you're going to be receiving a stream of metric updates for all those devices But then after a failure presumably just that one device you're going to stop getting reports for it So the idea is to basically infer that a device failed if you continue receiving reports for all the other devices in the system But that one particular device failed And perhaps you only do that if you see signs that the device is likely to fail And then it goes away Then you can sort of assume that it went away because it actually failed or because it was taken out of service And I think the real question here is a data science question Is it is there a sufficient signal using that kind of inference in order to train an accurate failure prediction model? and Probably the more like specific tasks that would need to be taken would be to validate that type of approach for your You're inferring those those data points from an existing data set using like the big back plays data set Which actually does have failure events because they took that time to actually say this device failed But to ignore that and try to infer failures using this method and then see if you can still train an accurate model In order to validate whether this in fact would work So that's a that's a question for a data scientist that hopefully someone will pick up So that brings us to the Sorry That brings us to the third place The Okay So we had the matrix collection we had the prediction model and now we we have the response phase So it's it's a question. What what do we do when when a disk is about to fail? So the question is how much time we have left If we have enough time, maybe we can just let the staff operator know that one of the devices is about to fail and then they can Do whatever they decide But if we don't have enough time, we would like the cluster to Automatically try to self-heal like we mentioned that stuff in self-managing and Can be self-healing as well So what we do is we will mark these always these out of the cluster and we will Migrate that data to other devices Yeah, we also divert the workload away from these devices so Yeah, we will not we're not causing more harm All of these so we define the thresholds to do these Actions so they're all configurable. I think that the default now is in case a device is about to fail less than two weeks from now We just automatically take action But other than that, I've left life expectancy is greater than two weeks This is a self operator Can can decide what to do that? Yeah, and then there's the question After the device is successfully off-loaded Well, what do we do that? Should should we drive it to failure? To prove the model was right or wrong Yeah Some open questions So currently we have we have merged the part the portions of the device management and Metrics collections and the respond they automate automatic response We still have under review the huge pull request of a profit store Hopefully it will be merged soon And we're targeting this feature for February 19th next Next release of self And in the future We wish to see an open SAS service that And open that say just like sage mentioned And to have an improved free and open source prediction models. So this is a call for academia for professionals for everyone who is Interested in this story in this project Everybody can help So this project Happened in collaboration with our 3rd year organization like we mentioned earlier Which offers paid internships which that promotes diversity in free and open source software There are many communities which are participating. This is just a partial List as you can see so the way it works that is that applicants Look at the projects that are participating They pick a project that they're very passionate about They contact the mentors they see Which contributions can be made to these to that project? And again, this is hands-on. So it's not just Studying theoretical stuff. It's just it's real and many times There is a certain Well, maybe a threshold needs to be the right word here That prevents people to contribute to open source because it can be intimidating so We with this organization you get the support of mentors which is Fantastic the contribution does not have to be just for developers you can contribute For the commentations bug fixes marketing so There are a lot of ways to contribute Yeah, and then you you fill out an application form I think you made many projects Happen many many contributions to open source And that's that's a win-win for everybody because the interns get experience the interns get again the support that is not trivial and also mentors Potential mentors if you have any project that you think would be good for newbies in open source, I encourage you to do that because Eventually it makes the project happen The internships run twice a year a year sorry next one starts December and Applications start at the open September 10th. You can see the website. I I really encourage you to go and check it out Tell everyone about it On a personal note I'll share my experience with the project so So picking the project I was passionate about I was making the contribution With the stage guidance and then once it starts a Very common Syndrome for interns is to have the imposter syndrome. So there is You're all starting to realize oh my god Maybe I'm not the right person to do that That the code base is huge. Where do I write the next line of code? You cannot pick out you have some fear and doubt but then Yeah, we all know that but we tend to forget nobody knows everything we all learn all the time and it's just It's good to remember it and Seth's community is Is is great. They're Sincerely happy to help they don't criticize you. They know that it's okay. Nobody knows it and nobody Not everybody knows the project and at the very first stages stage Told me that there isn't such a thing as a silly question Yeah, just think about it sometimes we are afraid to ask so Yeah, so first of all I would like to to thank our treatise organizers say sharp Marina and Karen and all the team Who are very attentive? Responded to everything super quickly and really wanted to help all of the mentors that Took part in this project and of course sage huge. Thank you for all your patience and All your will to help Thank you so The challenges that With just just a quick cap for the project. So we still have smart controls upstream We hope that We hope that we're gonna see that release around the end of the year We still have some changes to the architecture Yeah, and we still don't have the data to build our own model hopefully We'll have some open data soon And the outcomes that we had from the project is we do have a modular approach for the metrics collection the prediction model and the response We have a new participant in the staff community profit store for very thrilled about this project Yeah Sorry we Yeah, and that's it and thanks to the smart modules upstream Christian Frank who has been great to profit store for contributing And outreach you of course for setting up the internship. So we have a kind of out of time, but if you have any questions Please please find us and ask So the last time I did do this in production was a couple of years ago But one of the things that I found was that the data supplied by a solid-state storage Was you know both completely inconsistent between different vendors And often extremely sparse Has that improved at all in the last couple of years? it's It's quite complex. I mean the focusing on collecting the data well broken up in the data from a Potentially all the devices but what went at the beginning we were focusing on set up for hard drive So did they do oh Sorry So I'm not sure if it's if it has improved Yeah, I think mostly we're relying on smart control to scrape everything because they've been pretty thorough And I think that as far as what data if we're actually getting out and the ability to actually predict on that is sort of the Self-contained prediction problem at least that's the way I've been doing it And so we're starting with something and the hope is that The situation will improve over time. I was wondering how you evaluate the Accuracy of the profit store predictions The truth is that we haven't yet They've provided a model and we don't we haven't had the time to evaluate it yet. So We're focusing just on that on the system integration problem first Um, I guess I had a similar question. So about the model, right? It says like it's 95 percent 97 percent accurate But it doesn't tell a lot about like a more better metric would be what's the false positive rate and what's the false negative rate? Because if it's a false positive Then there's a bunch of things that you guys do to make sure that there are no problems, right? So how good is the model at in terms of that and how much do you care about the false positives? Um, yeah, that's exactly right. Um, there was a talk that I saw at vault last year that Defined it. I'm forgetting the technical term, but it's a two-dimensional matrix or essentially you have the probability of positive or negative false negatives That's the number that they've given us and again, we haven't actually analyzed it from a data science perspective. So Unclear Okay, thank you stage and yeah, sorry. Yeah, we have one more question So basically if I if I were to try something out like this Would I need to set up an agent to collect these logs and push it somewhere? The you're talking about the open data set Yeah, or yeah So they did I think that there's sort of two goals one is to make it work for staff out of the box And if we have a public data set target, you could just turn it on But the expectation is that not all storage is set obviously have lots of other storage in your data center So we'd want to build an agent that you just install any host and turn it on and point it at upstream and it would just Share that data. Thank you. All right. Thank you stage We have lunch served at the ziskin lounge We will be resuming session at 130. Thank you