 Welcome everybody, thank you for joining. Unfortunately I think you already know from the description of the talk that I'm cheating and it's not about data science but it's mostly about data. It's not independent from the data science, right? Because all of the data science requires some kind of data. Because without data you can't do any interesting thing with your GPU, just I don't know mining cryptocurrencies or playing some games. So I think the data part is very important and for me it's not just important but I think it's a very exciting part of the technology and in this talk I would like to just share with you this excitement or interest. Okay, there are two parts. First is the data and this data is mostly about Apache Hadoop because I'm Martin Alec and I'm working in an Apache Hadoop developer. I'm a Hadoop cometer. I also have a cometer in the RATIS project with a PMC member which is an embeddable RAFT protocol implementation which is a consensus protocol which is used in a new sub-project in Hadoop which is Apache Hadoop Ozone. I also have some dirty projects in the GitHub. My current favorite is this one. So if you hit the ham chart in Kubernetes then you may give it a try. Okay, so let's talk about first the data which is Hadoop. Do you know Hadoop or do you use Hadoop? Okay, a few one. It's not required to know for this talk but this is my view so maybe it's better to share my view, my context. So I have a two minutes full Hadoop course just for you so this is the way how I usually explain what is Hadoop for my family because they can't imagine what I'm working on. So Hadoop is a big data system which can run on community hardware so cheap hardware but on a lot of hardware and usually it's do big data calculation and there are two main problems with big data right because the big data is the data which doesn't fit on my laptop and one problem is that I need to split somehow the data between multiple computers the other one is the calculation problem to run the calculation on the different nodes and somehow summarize the results. So now I would like to focus on the storage side. In the Hadoop world there are multiple options to use the storage. The right side is using any cloud provider. Usually it's not very well known that it's possible but there are very well developed connectors to use S3, Google Cloud and obviously we have a storage cluster which is the Hadoop HDFS project and both of them have their own problems. With the cloud connector it could be sometimes easier to use sometimes it's cheaper, sometimes it's at least in petabyte scale it could be more expensive and sometimes it's not consistent or consistent totally which may it harder to calculate anything but in all implementation the HDFS which is a storage cluster it has also some problems. One of the famous one is that it's not designed to handle many small files but although it's very hard to use it could be used only from the Hadoop world and if you have a TensorFlow application or any other application it couldn't be used in a very easy way. This is the typical way. It's not just about Hadoop but almost all the storage system. This is very easy. If I have a file usually I just split it to blocks and the blocks should be replicated usually in multiple instances between multiple nodes so that's what we should do and especially in Hadoop we have multiple problems with the small files because it was decided that all of the mappings are stored in the memory in the master node and the other problem is that the slave nodes are just reporting the blocks and if I have I don't know 500,000 blocks then it could be a problem. So in the new sub-project of which is FHEE Hadoop Ozone we just try to split the responsibility of the master node and there are two main parts. One is just I need a mapping from a file to the blocks and after that the blocks should be replicated between the nodes. Okay and they are separated. We have two master nodes and the good thing about this abstraction that on top of the lower level abstraction where it just replicates the binary data we can provide additional services such as an object store Ozone is just an object store or HDFS or POSIX file system so it could be used from any other system or even any other Apache big data system which requires just something which is replicated. Okay so this is the word where I'm working and living and this is not just a separation of different part of the existing Hadoop but we need to optimize the current code and use more and more tricks to make it more scalable and yeah it's more consistent and more fast but that's the object store part. Let's talk about the science which actually this is just one very particular part of the implementation but I think it's very interesting and it's a good illustration about what could go wrong and what should be implemented very carefully in a storage system and the science is the title because it's based on two papers actually there are more people full of papers but they are the most famous one about the copy set and the theater application paper and the question is that what could go wrong in this storage system? Well when everything is good then almost all of the storage systems are the same so the real question is that how the problems could be handled and what are the problems? That's the next question. There are multiple problems actually. There are independent failures, this is something like the hard disk failure or SSD failure this is not so dangerous according to the paper or according to the researches because there is very low chance to have a failed hard disk and it's even lower to have two failed hard disks at the same time so if you just replicate it multiple times usually even just two replicas are enough to survive an independent hard disk or node failures There are more tricky ones, the correlated failures when you have multiple node failures at the same time and there are two types, the topology related this is usually when one rack is going down so if one rack is going down it usually could be solved just to copy one of the replicas of the data to another rack or another region or with some hierarchy but there is a third one, the topology independent failures and according to the data from the big cloud provider once per year if there is a power outage and nodes are just restarting the 1% of the nodes couldn't be started I don't know if you know the feeling if you have a server in the data center and your uptime is more than one year and you need to reboot it it's also a very interesting time because it may or may not be restarted so that's exactly the same problem so after a while you have a chance that it couldn't be started and it's independent from the racks and that's the beginning of the problems so let's recall that we have files and the files are split into blocks and the blocks should be replicated for the sake of simplicity I will upload here just small files because in that case just one block is enough but it works in exactly the same way with huge files just it's easier to draw in there to the slide so I have file 1, I have data nodes so these are the slave nodes and I'm just storing the blocks to different data nodes I have three replicas here, three replicas here and my favorite part I can kill one data node, totally random so I choose the number totally random at home so this is the second data node and I'm killing the data node so no problem here, right? because I have multiple replicas from both of the blocks so I can kill the second data node let's say three in this time still I survive, I have replicas from both of the files so just choose another data node randomly let's say five and that's a bad situation because five two is no more available and the big question here is that how can we survive? does it help if I buy a lot of more nodes and it turns out that it doesn't help at all and to understand that this I would like to check the situation from the other view so here still I have the data nodes, I have blocks, I have files but if you can see the second file and fifth file you can see that both of the files are replicated to the data nodes one, three and four so let's use this as a base element this is called in the paper as copy sets so we have a set of data nodes which contains multiple blocks actually so the second file and the block of the fifth file are both in this copy set the one, three, four data node copy set so the next math question how many copy sets is possible if I have six data nodes this is a simple math example this is binomial coefficient or something like this so this is how can I choose three data nodes out of six and there are 20 possibilities and for example one copy set is the one, three, four data nodes one copy set is the one, two, three data nodes one copy set is the two, five, seven data nodes and let's say I have a lot of blocks or not a lot, a few but two doesn't and because I choose the data nodes randomly then at the end of the day I will have roughly 100 blocks at each copy set it doesn't mean that node one will have just 100 blocks because the node one could be part of multiple copy sets is it clear? hopefully more or less so these are the sets of the nodes and these are the blocks which are the nodes and the problem here is that I have all of the different type of copy sets the all of the combinations right and if you choose three data nodes totally random you will choose one copy set which contains hundreds of blocks so you can't choose three data nodes without data loss that's the problem so that's why it doesn't help if I scale up to let's say 600 data nodes because with 600 data nodes with the random replication I will do the same I will generate randomly the copy sets the different type of sets of the data nodes no I have 35 millions or something like this of the copy sets but if I have more blocks then the situation is almost the same I will have a lot of different kind of sets and all of them will contain let's say 100 of blocks and in that case if I kill randomly three data nodes the data loss is guaranteed so how can we do it better? let's try in a different way still it's easier to draw with six data nodes so let's try to do just two copy sets so these are two groups of the data nodes and if you have a file I will choose randomly one of the groups and I will just save the file to one of the groups so block one is saved to the first group to three data nodes it's saved to the second group so is it better or not? that's the question here we have just two copy sets right and but on each copy set we have 1000 blocks or roughly 1000 blocks if I choose randomly okay what will be happen if I kill randomly three data nodes? it's a very high chance actually that it won't cause any data loss because it will cause a data loss only if I choose one, two, three or four, five, six data nodes and if I'm lucky enough then it wouldn't be chosen okay, maybe so actually this seems to be a more safe option just to do two distinct groups there is one interesting property of this replication the number of the blocks which will be lost in the round ball replication there were 100 blocks on each copy set not on the nodes but in the copy sets so I would so there is 100% chance to lose 100 blocks in case of three data nodes failures in the same time with the second group the chance is lower because I have two copy sets 2 to 20 so it's the chance is 2 to 20 but if I'm not lucky enough then I will lose 1000 so this is something like this so 10% chance to lose the 550% of the blocks or 100% chance to lose the 10% of the loss what do you prefer? two groups yeah, usually the cloud providers prefer the two groups because there is a fixed cost of recover of the data because anyway you need to find the right tape and just load the backup so there is a fixed cost of so it doesn't matter if it's the 10% or the 50% because you need to recover anyway so it's better to recover just once per year even if it's just a little bit more data okay, so did we solve the problem? it's unfortunately this is not the best one because still there is a problem in the case of the random recovery let's say I have just one data node which is down and I need to recover the data node once so I'm just replacing the hard disk and I'm just copying back all of the data from the replicas in this case I can copy these 100 blocks from the 2 and 3 in this case I can copy the 100 blocks from the 2 and 4 so I have a lot of sources and I can copy from the sources all together parallel but in case of the two groups recovery if the data node 1 is down I can copy only from the 2 and 3 and actually I need to copy 1000 blocks parallel but only from 2 HDD or 2 nodes so it's very limited in the in the IOS side so how can we improve it? well there is one there's multiple options actually but that's the target to find some sort of balance between the number of the copy sets and the number of the source data nodes so this is just 9 data nodes and here I would like to create 6 copy sets for each row and for each column so it will be 6 copy sets right? here for example if data node 1 is down I can copy one of half of the data from data node 4 and data node 7 and the other half from data node 2 and data node 3 so I can use 4 data nodes sources to recover and the chance to lose data it's still very low so it's 7% which is actually still way more better than the 100% so there are multiple options how can you choose but basically this is the structure that somehow we need to choose different type of combination of the nodes and maybe 2 sets of the combinations and it could save a lot of data okay actually this is the summary of the 2 approaches so this is the total random replication and the chance of data loss in case of 3 data node failure is 100% because anyway 1 copy set will be down but I can replicate from all of the others right? for 2 groups the chance was very low but I have I don't have enough source to replicate in case of 1 failure and this 6 group or this copy set selection I think it's a good balance between okay so that was the replication algorithm actually it could be improved a little bit for example to calculate also with the rack awareness but the main ID behind the replication is the same okay so that was about the copy set and tier replication both of them are just finding the right copy set there are 2 main forces the number of the copy sets the set of data nodes and the other one is the number of the source data nodes to recover the data and we need to try to write balance between them and yeah I'm working in FHG I do also and I think this is not the only problem for a storage system but maybe it's a good illustration why I think that it's an exciting part of the storage of all of the other system okay any question this is a very good question why don't you have these 6 nodes number like same one in the lab than the beginning so compared to the previous line where you have like 6 nodes and you have the same set once you have like 5 states yeah and what's the question why I have only 6 nodes like 6 this one or this one okay I don't know which slide you are I think it's the type where you have 6 nodes from 6 nodes the possibility to choose 3 is 20 I think oh that's bad I agree so you can read it model those 6 right oh sorry yeah thanks it's an interesting question the big difference is between the two paper that the first paper couldn't do it very dynamic way so there is an algorithm to choose the right copy sets and but it's very hard to put more data because you couldn't originally you created a copy set and it's very hard to extend it but to be honest if you have 600 nodes it's not a big problem if there is 2 nodes at the end which is but because we have multiple copy sets practically all of them will be used because there are 1 node is part of multiple copy sets but it's possible that 1 node will have just I don't know 1 copy set and other nodes could be part of 2 copy set so it's just could cause a little balance problem but not a real problem I think it's not the HDFS actually a similar but not such powerful replication it's used by the facebook on top of HDFS this is in the old zone which is some kind of offering the HDFS and that will be the default somehow maybe in a more advanced way just to calculate just to use all of the I don't know, wreck awareness and regional settings and all of them yeah How do you progress with the production really? The old one it's a very good question we have 2 offer leads so it's not yet but it will be very soon hopefully in that time it will be very production ready ok then thank you very much