 So next talk or now I'm giving it is about the computer science behind the modern distributed data store and It is a hard problem and what I would like to tell you today is like the four or five major Challenges that we face when we create distributed database which we at the ranger DB. That's the company I work for our building right now and which is out there So most of this is already done and I would like to talk about it and how computer science helped us Implementing all this stuff. So the toughest challenges that I've selected for this talk is first of all resilience and consensus sorting log-structured merch trees, which is super important topic for us then we have Logical clocks and finally how can we do this full asset thing that you know from the single server relational world in a distributed environment? Distributed asset transactions the bottom line of this talk is You need computer science to implement a modern data store. So you can try with trying error, but most likely you won't succeed Yes, please Okay, I can try to speak a bit larger. So I thought it would be on the microphone But I guess the microphone is only for recording, right? Okay, then I will just speak louder. Okay, thanks So the format will always be I have a problem that I'm going to describe and then I will show you how we can use computer science to Solve it the first problem resilient resilience and consensus So a modern data store is distributed So the data is too large for a single machine. We want to add more machines We want to have read scaling write scaling and all this weird stuff. So the data store is distributed Because it needs to scale out thing that I was talking about and it has to be resilient So no matter what happens the system should be there If a machine is on fire the system should be there it has to be resilient and This has the challenge that different parts of the system need to agree on certain things So if you say I'm storing a new document and The database says yes, I stored and you ask for the same document again You don't want the database to say oh which document I don't know about it. It should be there so the service need to agree that it is there and Consensus is the art to really achieve this as well as possible in software only and And consensus is relatively easy as long as everything is good So you have networks you have your machines running and they can communicate to each other then it's super simple actually But it's super hard if No wrong button the network has outages or The network has dropped delayed or even sent the message twice or even more times If disks fail and come back with old or corrupted data if machines fail and Come back with old or corrupted data or even if your entire data center a complete rag is is down and comes back with corrupted data and Yeah, you might argue her okay on my laptop. This never really happens in my lifetime But if you are running on a large scale One of these events most likely will happen each night And I'm pretty sure no one of you wants to wake up in the middle of the night to fix this So you want the server to just work? And We even have not talked about malicious attacks from enemies a solution to this problem Several consensus protocols Traditionally so the first protocol is called Paxos written in 1998 Paxos is Super good, but it's a challenge to understand and even more of a challenge to implement it correctly So I personally have read this paper like five times and I'm sure still not sure if I understood everything But the thing is super complicated and for us various variants exist to make it a bit simpler to implement and to follow and The most well-known one is called raft from 2013 and This has been designed to be understandable and implementable. It is still complicated, but it's doable so my advice is if You're interested in this topic first try to understand Paxos Read it a couple of times So for some time try to understand it But do not try to implement it. That's most of the time of waste of time and After you understood Paxos are kind of understood it and draw the beauty of rough and see how simple it gets If you start with rough you still think oh, it's too complicated, but this way actually works But my advice do not implement it either because it's still too complicated So you should use if possible by any means some of the best Test implementations that you trust so there are a couple out there let's say zookeeper then we have Etcd and In a rango DB we have also a rough version which is called the rango DB agency and I think there are a couple of more I think Neo4j uses zookeeper, right? Oh rough implemented as well. Okay nice But most importantly do not try to implement your own algorithm Unless you like have two years where you can like freeze time implement this algorithm and then continue where you are But otherwise probably would lose a lot of time just inventing the whole thing And now I try to do a really hard thing try to explain raft in a single slide First of all we need an odd number of service Why they are odd will come to in a minute Each of them keep a persisted lock of events so they just write down this event happens that event happens in one order Everything is replicated to everybody So that means each of the odd service has the same lock and has to have the same lock Then they democratically elect the leader with an absolute majority That's why we need an odd number of service Because then one of them will have the majority in a network split So we cut the network into a half then one half will have the majority of service the other half will not have the majority of service so they can elect the leader and Only the leader may append to this lock So you can only write on one leader and this makes sure it is persisted everywhere and an append only counts and is successful if a majority of all others have persisted and confirmed it and These two rules actually allowed to shut down any of these machines at any time and Still have a correct state and a consistent state across all the others then you need a very smart logic to ensure there's a unique leader and Automatic recovery from failure I Think I could give like a full hour talk just only about the smart logic So Let's say we need a smart logic to have these things guaranteed. It's all a lot of fun to get this right But it is proven to work. So we had a lot of battle stories where we had like tiny little bugs Which didn't work out at the beginning and was hard to find but eventually we succeeded The next thing is one puts a key value store on top of this rough protocol and The key value store will then basically fill the entries of the lock by events and Thereby you have a consistent data store key value store, which is resilient against failure Let's try a short demonstration. So I have uploaded these slides. So they should be clickable and If that works, oh, I need internet connection Haven't set this up. So never mind. So the idea of a rafters that we have like five servers one server Like first gets into a timeout sends out the event that I want to be leader to all the other ones and As soon as the majority says, oh, I expect he I accept you as the leader He will be the new leader. Otherwise the next server try to get the leader until we have one leader and During this process. No events will be able to be written Eventually we will end up with the leader if we shut down the leader one of the other ones will figure out Oh, the leader is gone. I will have to take over and so on. So we always have one leader unless all of the servers are down Something happened with my slide, sorry Yeah, but I think it downloaded stuff from JavaScript libraries Act right The thing is I can't yeah, just give me a minute. It will be There we go, thanks and Hopefully that works in a minute Doesn't seem to load this one Somehow doesn't work. So never mind Let's continue without demonstration so let's sorting Next problem. So first of all, we now have solved that all service can can agree on things next problem sorting So the problem data stores need indexes Indexes typically are sorted. Could you please close the door? Could you please close the door? So the problem data stores need indexes Indexes are sorted. So in practice, we need to sort things not an unknown problem but Everything you learned probably at the university or at school about sorting algorithms Arrabbish on most of the sort the published algorithms that we have are rubbish on modern hardware Because they are optimized on the number of comparisons you have to take But none of them takes the modern hardware into account Because the problem is no longer the comparison in computations But the data movement because it is more expensive to load data from lower levels of memory caches Then it is to do a couple of computations on high-level caches So since the time where Apple 2e was blazing fast hardware, which is a couple of years ago The compute power in one core has increased by roughly 20,000 percent a single memory access only by a factor of 40 and In addition we have up to 32 or even more cores in a single CPU So that means the computational power has outpaced the memory access by a factor of 1280 roughly even more So that means we can like do To one thousand one thousand two hundred operations more in our CPU for one day to access in the same time Then we could in the time where most of these algorithms actually were invented so we need an Ideal algorithm that can do parallel sorting and makes use of the amount of CPUs that we have and that makes use of all the cache layers that we have in our main memory and The only algorithm that really works there is a much sort So idea for much sort is I have blocks which are sorted from the beginning this block is exactly one element large so Sorted by definition and then I just merge two of them and they are always sorted in addition, I will take each sorted block Each of them and put them into one min heap a mini is a structure a balanced tree So balance tree means all these parcels have around the same length some of them may be a bit shorter so most of them have the same length and The only rule is the first element so the root element is larger than all Elements attached to it. So all elements below are higher and it is pretty easy to remove this top element from the structure and Put one of the other guys on top because then still the condition is true and with this thing I Can actually put a couple of these main heaps for each of them together and just look on the top element to decide Which one I have to pick If I would do Let's say a quick sort algorithm instead I would pick one of these elements and then search through the entire block which could be large and put it on the left or the right-hand side with this structure, I always look at the top element and the upper part of This tree is actually fitting into level one cache of a CPU and a bit more fits into level two cache That means I have super fast data access on the top of the tree and There I can do a good sorting on a large data set. So with this algorithm You can actually get all your 32 CPUs working on 100% Utilization on sorting a huge data set Which is the desired solution to get fast sorting because nearly all Comparisons hit level one or level two caches Next problem lock structured most merge trees. So the problem is people rightfully expect that a data store can do fast writes So why should my data store be slower than writing to RAM? But at the same time should be able to hold more data than RAM It should work well with SSDs or spinning disks or whatever hardware I have It should work Should have these fast bulk inserts So I have large data set and want to push it in as fast as possible and Of course, I want to have super fast reads Especially if my hot set so the majority of the data that I'm using in my application fits into main memory Why should my application be slower than this speed? Well, how can we do this? So traditional B3 based structures like we have in relational databases Often fail to deliver the latest to So bulk inserts the beast B3 structure actually has some drawbacks there when we try to insert a lot of data into different layers and They don't provide super fast access if the main memory does if the main or the hot set does not fit into main memory How could we do that? So the solution is a so-called log-structured merge tree It's a several layered or level approach So you can define how many levels I want to have in the first level We have like short blocks of data all of them are sorted But these are short enough to fit into the level to cash Or into like level three cash, but not larger So we can easily or fastly write to those things and Of course, they have to be sorted with the algorithm that I have presented before as soon as we run out of these things We need to do a compaction task and a compaction task will take a couple of these blocks in level zero merge them together into larger blocks Which are still sorted and Push them down on level one. So level one is larger, but we know the large block is sorted And if these fill up we pull it down to level two Again much larger still sorted and so on and so on and so on However, this push push down can happen in a background thread. So it doesn't block any ongoing work Right can be done in level zero. So it's pretty fast because that is main memory mostly a bit backed up on disk of course and So thereby we have fast write fast writes We have a good compaction going down there if you want to search something in here logarithmic scale however In addition to these sorted blocks We insert something which is called a bloom filter or the more modern version a cuckoo filter a Cuckoo filter is a persisted data structure Which you can ask I have a key in my hand. Do you know if this key is stored in your block and the thing Should answer yes or no But it is allowed to lie But only in one certain case So if the bloom filter says no, I don't know this data set Then it is guaranteed that it is not inside this block if it says yes then It may not be in there But this is not as bad as it sounds because the thing is if I need to find a certain data set I will start at level zero ask either search in here directly because we are in main memory or ask the bloom filter and Most likely only one of these things will say yeah, probably need to scan my block because I may have the data If you don't find it here, you just go down and Most of the time only one of them will say probably I have it and then the further you go down The lower the chances is because you increase the bloom filter However, is it if a data set is not stored at all? It's actually quite fast to find that out because all the bloom filters which are fast will say no I don't know the data set As far as I know is constant time plus memory access so because it's hash-based But it doesn't have this logarithmic thought search, which you would have if you ask inside the data And the next thing is the bloom filter is small enough to put into main memory Although this sorted list of documents may be like two terabytes long So summary The first writes go into these mem tables, which is level zero So in memory map memory map files most likely All files are sorted. Oh, and I forgot all of them are immutable So as soon as I have pushed something down into level one, it stays immutable thereby, I can super easily catch the bloom filters Because the answer will never change Merge sort can be used if I push them down so We have a pretty fast algorithm to create the sorted set of the lower layers All writes only have sequential I o because on level one or a level zero the mem tables are just a pant maybe I have to resort on pushing it down on level one and For the other ones I always know. Oh, I have the first element the second element the third element I actually know how much the file will be because I know both blocks how large they are Then we have bloom filters or cuckoo filters for fast reads So we get a good write throughput because we write to main memory and reasonable read performance of Course these things will be slower than putting everything into main memory and have a hash and takes on top of it but they are made for the condition that your data set is larger than your main memory and As they are so good. They are used in a variety a large amount of databases So in big table Cassandra H base influx level DB Mongo Arango DB rocks DB sequel light and MongoDB uses them via via Tiger So a lot of databases have this technique Next problem so The last two we had sorting block structured merge trees were important for a single note. They were not for distribution Now let's move back to distribution Distribution we have several machines each machine has its own clock and With the probability close to 100 percent those clocks are out of sync So that means whenever we put a timestamp into our Document and move it over to a different server this different server would say oh this thing is kind of old or it's in the future which is bad, so we need a solution for this and General relativity poses that we have those things so Documents in the future and we don't know which is the exact order of incoming events in Practice clock skew happens. So we have of of a couple of milliseconds So in theory, this is around 20 30 milliseconds that we have in clock skew even if we do all these Network clocks in combination protocols If you are Google or you have the same amount of money You can actually buy atomic clocks and put them in every of your machines. So you get that a little bit closer This is the thing they did with spanner so then you can actually rely on the clock of the machine more or less and the network time protocol helps to Do like the arbitrary man hardware to get it down to 20 milliseconds, but still there's there is a clock skew Yes, please Yeah, there is still there's a network delay Between all your local service. So it may be a bit smaller than but still you will have clocks keys You're welcome And of course this thing is also designed for data center to data center replication and cross data center functionality works there as well so Because of all the above mentioned problems We cannot compare time stamp from different nodes. So you can compare the timestamp of the same note But as soon as you try to compare times and from different nodes, you will end up with a broken ordering of things Why would this help? because in most cases you actually want to have this Real order of events what happens one after the other assume to you to users updating the same document Probably it's good to know who was the first one and who should get a conflict If you have like master master replication right is here right is there. It's unclear who was actually the first one so for conflict resolution for lock sorting or Or even for detection of network delays So whenever you see oh there was a short network outage of a second you could see that because the message was actually a bit delayed Next thing that you often use in databases is so-called time-to-live Easiest example is a session so user locks in and He should stay locked in for the next two hours and if he doesn't do anything in the two hours the data set should go away Hard to implement in the distributed system. So what is the idea for a solution? So every computer has a local clock so It may not be accurate. There may be clock skews, but we have one And we use NDP to synchronize so we have like an upper bound of delay So if two events on different machines are actually linked by causality So that means I get an event in here and because of this event. I had to find a different event on the other machine Then the course Should have a smaller time stamp than the effect Basically user request comes in first time stamp I've write the document second time stamp and I send it out then the first time stamp should be smaller than the second time stamp and The causality is a message is sent So thereby we send a time stamp with every message and Then we have a hybrid logical clock because it's a hybrid of a physical clock the one that is attached to the computer and A logical clock, which is just a number which adds on top of the logic of the physical clock And the idea is that the hybrid logical clock can always return a value Which is larger than the local clock and The largest time stamp you have ever seen so that means if the other machine says I'm already five minutes in the future then the time stamp the logical clock will send out will be five minutes in the future in comparison to your local clock and Therefore, I will just take or the clock will just take the real time stamp of the messages. It has seen plus adds a small fraction So that this time stamp is actually larger than the logical time step from the other machine and eventually over time the clocks will synchronize again and Thereby the local clock can actually catch up with the largest time stamp that I have seen So there will be a small portion where it is off But the hybrid logical clock actually fixes this and then we have the guarantee that whenever I have two events that have a Causal causal relation, then I have an ordering base to the hybrid logical clock approach So causality is preserved and we have time to catch up with the law the real time eventually If you want to read more details a blog post about this is written down below So I have shared the slides all the links will be available next topic I Think I'm good in time right five minutes, right? Yeah, that will work. So distributed as a transaction Database world as it first of all atomic Entire transaction either works entirely or not at all consistent So I see a consistent state when I start my creation my my transaction And if something happens in between by other users, I don't see this state Isolated so concurrent transactions don't see each other and Durable is whenever I crash it's still there So all of them are doable if the transactions happen one after the other because then I can have an ordering and I can see What is ongoing? But we have more machines. So they are not one after the other in Distributed system. We have to make sure that all nodes Agree on whether the transaction has happened for a termicity Because then I can ask is this is this transaction that I see here done or not How can I create a consistent snapshot across nodes? For consistency and how to hide ongoing activities until a commit for isolation And how to handle if one of the nodes is lost for durability We have to take replication resilience and fail over into account especially for the last point What happens if my machine crashes and it said it has committed something Then the failover machine should have committed as well. This is a whole topic For like a week of talks How to implement that right so but with all the things that we have before we can actually get pretty close to these guarantees So we need something that agrees on the status on transactions. We use raft We use hyper logical clock to get a timing ordering on all the transactions Then we need failover of Course I won't go into details there and the last thing I missed Isolation this again is done via this agreement on the transactions So we could either use the raft for like everything that happens Works, but the thing is the rough protocol is super slow. It is super consistent, but super slow So that means your database won't be scalable anymore So we need to find some kind of hack around it And we just need to agree on certain like snapshots or points in time Which should be consistent and then need to find something which gets away with all these stuff in between and Because this is so hard. I just created a list of which distributed databases do it and which don't and most don't So a rango DB doesn't have distributed asset transactions yet Big table couch DB and your couch base data stacks and so on so actually most of the well-known databases don't have it Cockroach DB claims to have it So I think they pretty much are there. I haven't used them in production yet Google spanner claims to have it because they use like Google's atomic clocks and don't have the issue with timings and Rango DB has a plan on how to do it Which is ongoing work right now. So soon or later we will remove a rango DB from here and move over there and Very few of the distributed engines promise asset because it's so hard and so many things can go wrong And you have to design do the design for failure So the basic idea use multi version concurrency control So we can have multiple revisions and we just have to make sure who sees which revision rights and replication decentralized and distributed So without them being visible because we haven't agreed them to be visible then we need some place where we can do the switching and This place needs to be persistent scaled out replicated and resilient and Here we are actually by Starting over here again, right? So we need a system that actually solves these issues to solve these issues Never mind. This is a bit easier of a problem because we have more control over it And you can reduce the amount of data that actually flows in there So transaction visibility needs to be implemented with multi-version see concurrency control and timestamps play a crucial role therefore, hyper logical clocks That's it all the links if you want to get into deeper details, I will be around for This day We are an open source project So if you like to talk and would like to support us It's super important for us if you like go to GitHub and click on please add a star Because that will help Otherwise, I'm open for questions. Thank you very much comment I do believe that Yeah question about the logical clock, is there no risk for Cascading push of the clock into the future. Yes, there is a small risk of it skating into the future And however in practice, they will catch up because of the NTP protocol And I think we can push into the future like with a couple of digits. I think eight or so It will push at some point because there's the real timestamp And then you have this offset and the offset can never overflow into the real timestamp And then at some point you actually get the real timestamp Yes So the the question is if the The hybrid logical clock gives you has a proof that it actually gives you a real clock which Supernizes this and yes, it is proven written in that blog post So I can't sketch the proof right now, but yes, it's fundamentally proven So The question is if the consensus algorithm is resilient against malicious attacks I Think the algorithm itself Not so really so if you get into one of these players it I don't know Yeah, so but the thing is that this algorithm is like not open to the public It will be inside your own network behind the firewall and if you are in there then you probably have easier targets Yes, please Okay, yeah, I forgot to talk about this so what's the driver to actually get a distributed system into asset and The thing is development experience because if you have asset guarantees on the database It's super easy Comparably to write the application if you don't have this as it guarantees that means you can end up with like lost data You have to make sure you write something and maybe if you read it. It's not there All these failover stuff you have to handle in your application Then you may see like non isolated things so Gets kind of hard in the application then Yeah So the question is if it isn't easier to live with all that things in the application than it is for asset Probably yes, but it means shifting the work from your shoulders on our shoulders, which is pretty desirable for you So it's all good Yeah, oh, sorry, yeah So the question is if those are the only two databases out there in the world The answer is I must admit I don't know because I haven't created this talk myself I would have to ask Max now if he only included open source or also commercial ones So I don't know Right I think cockroach is