 I'm not can't say I love it or hate it yet. It's different. It's so different. Is it recording, Eddie? Yes. Okay. So today we'll be doing this talk under the hood. Storage and retrieval in database from scratch. When I mean from scratch, I mean like really, really from scratch. If this example today does nothing other than storage and retrieval. You input a key, you input a value and it stores it into a file. I've written four code examples of which the last one I was too tired and I didn't write it to file actually. But the concept is there. So this is if you want the full experience. Marking fragments, designing data intensive applications. Yeah. So, so introduce myself again. I'm Su-Yang, 10D and that's why. And I'm a software developer at Shopify. So that's how come to hook me in. And usually I write nonsense in HTML, CSS and JS. So my code pen has a lot of nonsense. So like the hook there is done completely in CSS. And this is a musical instrument. Yeah, I guess that's. And then this is some audio manipulating things. The slides are here. This slides was used in GitCam HD, which happened two weeks ago, I think. I can't remember. So all the examples on this slides are in JS. But then since comes in, put me in. I decided why not write it in the programming language that I actually use in my job, which I realized that actually I don't know Ruby very well. So that's that. Yeah, so fundamentally the database suggests to store data and retrieve data. And there's two kind of big terms you can use to describe database. Online transaction processing and online analytical processing. So we will focus entirely on transaction processing. I don't think we have enough time to go through the analytics and I also don't know much to be honest. Yeah, so transaction is usually your business logic. So you will do your CRUD, your create, retrieve, update, delete. I'm not sure if that's the term being used nowadays. That's what I learned in Poly. And you need everything to be written fast. Your reading has to be relatively fast. It doesn't need to be as fast as when you do analytics. Your reading has to be accurate because when you do analysis, it's okay to have it come in batches to come as a stream. But for your transactions, usually you want to know that it has been input into the database right now or whenever you want your next step to happen. So that's the OLPP that we'll be going through today. So first example is the world's simplest database and that's the JS code and we jump to maybe bigger font size. Okay, so the simplest DB. I took some liberties in the term simplest. So the example that was given in the book was actually written in batch or whatever unix language that the author was using. So it was literally two lines of code. So I took some liberties with that to mention the same ideas that he was bringing into it. So the simplest DB is to just write it to file and notice that we are appending instead of rewriting or whatever. So appending usually is very, very fast. You just put something at the end. And then retrieval is a bit more involved. So here generally this is more complicated than it needs to be maybe due to my lack of Ruby skills but the idea is very simple. You just iteratively go through from the last key all the way and then once you hit the key that you are looking for you will retrieve it. That's all this mountain does. Not really as complicated as it needs to be. So if we run or I haven't really run this that much since I wrote it so let's hope it works. So interactive Ruby and then we go load. So now we have what's it called? Simplestdb Simplestdb.db set and then we add some key 6 because I have to be a string. We add some value and then and then it stores it in Simplestdb here. So we look through some value is written here and then there are some previous key that I added and then when we retrieve and write like a in plus db.db.get let's retrieve the test. Oops it should be a string. Ya and then it retrieves test tree so the first the first time it hits it just returns ready. So usually why do we do it this way is that we don't really want to waste time replacing the docs we just want to write very fast and that's essentially this is actually like more effective than you would think because this is being used as the logging function in most databases that are existing like not really our usual transactions but the logs so you have a historical series of data that you can refer back to what you used to be and what you change it to but at an expense of having waste a lot of storage so how some databases can deal with this is over some time they will be able to compress the data by merging the different things together and usually this is done when you switch to a new file start writing to that file and you merge the old file then once you're done with that you go on to another file so you always have historical data but you will still have storage space so that's yeah so some learnings when I wrote this I can't remember all the points that I wrote down already but yeah writing has a very good performance when you just append you just need to know at what point of the file you add at the end of the file and then you just append so again as I mentioned earlier this applies to logging subsistence in databases not the actual database writing itself but just the logs it requires you to find the start of the next key so that's why we have the semicolon as a delimiters and also between the key value we have a different delimiter that's just for convenience sake linear search for records is O and B because large large string yeah estate characters will mess up a story if not handle properly if let's say you want to do something like slash n and you use character as the character length to dictate how far back to search you might so for a new line and yeah so writing to multiple files and hence require some form to remember where you are so as I mentioned if you want to expand this to be something better to be able to handle larger spaces when you want to compress we need to remember in which file you are writing at and you shouldn't overwrite you shouldn't be writing in a file that you're currently reading because at the end your data will change okay so some improvements that we can do to this mention the segmentation so you might want to split into new files want to hear a certain size compressing there's compacting merging segments you can also have multiple files and then you can merge two segments together and then compress can do this together crash recovery I don't remember what I wrote that you can do some checks on your files on your different segments to do some crash recovery okay so next one hash index so that's the example there's also a demo on the web if you want to try it out on code pen or you can just check out this preppo so now we load 2.2.rb and okay I think there's some testing code some non-sense testing code and then what's it called? persistence so if you look through the code here you will realise that the code quality is a lot better just because I asked a colleague to help me go through this so this is more rubies the classes are better defined they are not singletons like I did it but yeah if there's a bug it's also his fault because he was the one who did it and I just merged it no it's not persisted hash index so at the end of this initialise did we set something? no we didn't set anything never mind yeah error oh we stop calling it did we get no? okay there's some bug here oh yeah because it's not a singleton anymore okay oh yeah okay the better software developer wrote it properly so what is happening here is whenever we write a set let's look at the index of JSON file first just like hash tree key and a value and this is the JSON file everyone familiar with JSON? okay good I'm just using it as a way to solve like a hash map as a way to revenue hash map of course the effectiveness of using JSON is quite different from using hash map the hash map has a more complicated way of detecting collision of hashes and everything and it's a bit faster quite a bit faster but the general idea is just we have some kind of indexing idea and we store this index into the storage as a backup so if we crash or anything we can still retrieve the index and the index here has to be able to fit into the memory so it cannot contain your complete content of your data that's why I'm just pointing to the the length of the string where I'm retrieving it so this will be in a proper database it will be the address of where you want to retrieve your data so when we write db-set db-set and then we put in some other value oops and I then it did not work maybe maybe I need to refresh or maybe not oh oh because of okay okay okay then we will see that it's added the number into the hash and then we jump into the actual data file and then there is a test for has been added here so when we retrieve it right now because this has been overwritten due to some testing we realize that we cannot actually retrieve test at test one but we can do we can retrieve test three actually let's retrieve test four and then we get the key and value based on the address which is the 114th corrector in the file so if we overwrite it and then some other value so all that happens in this is that this does overwrite and then when we go into the file here it still just depends to the end oops it still just depends to the end of the file so this kicks our writing appending to the large file fast and then this it doesn't matter if this is a few seconds late or whatever because this is just a backup representation of what our hash actually is this values are actually in memory so when I go and retrieve it it's very very fast because it's stored in my RAM and then when and then to read to read and write is the slow part which is why we only append to this big file so strength sequential writing is very fast for this drive so that's why we still maintain the appendings and then there's a concurrency and crash recovery that we yeah so our data in the database is always correct because we're always appending only we're never manipulating any of the data yes your segments are immutable when we when we are okay basically it's because it's append only immutable doesn't really I don't even remember what I write little disk fragmentation because you're writing sequentially so some limitations is that you can only have so many keys in your memory key and address paths after a while you tend to run out of space also so there's a limit depending on your actual physical machine and if you want to do a query of a range of keys or you want to do some kind of calculation on the indexes this is not very efficient because your hash table is still it's still a hash table it's still only 0.1 retrieval you can't do any mathematics with it out of the box at least you need to try to handle it yourself okay so the next example how am I on time still okay we are going to do a sorted string table and LSM3 now this one is my interpretation of what the book is saying it could be entirely wrong basically what I have is I have a this is quite a bit longer now I have this self balancing binary search tree and what happens is when I insert I will have a binary tree and I will try to keep it balanced as possible so whenever I retrieve it will always be a lock and yes it will always be a lock and retrieval of the key value pair however I can't store the entire database into my memory entirely so I cannot have a binary search tree as as the database itself I can in the next example but not in this not when I want to store in the memory so when you do this there are a few strategies to do this but basically how I'm doing it in the most naive possible way is after I hit a certain size do I have do I remember where I put the code after I hit a certain size which is okay here so here we have the file name and then the delimiters after my binary tree has certain number of nodes I flash the entire tree and then because now I'm flashing from a binary search tree all I need to do is I need to go in order and it will always be ascending it will always be a sorted string and then I will merge it to a sorted string file since both of them I know are sorted all I have to do is to iteratively compare which of the smallest value is smaller then my new file will be sorted and it will be ON operation so we will bear the cost of maintenance of a writing we'll bear the cost of this data structure on right where we will balance the tree so that's where it starts to slow down but at the expense but but the trade off of that is that we get a lot faster reads it's all and we can do a range we can search over a range of different keys now so many other okay let me actually show the example so in in the file here you can see that all the key value as you can see I got lazier with typing typing my key values over time but all the key and value are in in sorted the key is in sorted value and the order is the order in which I enter it in so in the good example here I was using 1 2 3 1 ACBD and 1 2 3 4 so it will self-maintain the sorted order thanks to the data structures that I was using so let's load tree I don't need that do I need maybe I need to remember and then we can do I will say SSDV so here we will set the key is the key is the alphabet and then when I get so getting the E is still not written down yet technically you can write it down somewhere you can do a rare representation of a binary search tree but I didn't have the capacity to do that I had the time I just didn't want to but yeah so what happens is that you will usually search through the tree which is in memory you can do a backup of the tree that is in memory if you are afraid of crash and once you once you finish the tree because in a binary search tree once you reach the end and you still haven't found your data it doesn't exist on the tree because you know of the sorted order then I will do then you can what you can do is you can do sorted search on a sorted array I think you can do a log end because it's sorted you can do a log end so you can do a log end search on this just based on finding the midpoint is it smaller and larger depending on the key and then you go you go through so it's relatively fast still because the tree is in memory of course it's a lot faster than doing the search onto the string so some strategies of that is instead of flushing the entire table to the db what you can do is what you can do is you can selectively choose branches that are not recently accessed and then flash that to your database while keeping the branches that are very recently accessed in your tree so okay so yeah the LSM stands for log structure merge and this happens periodically in a rolling function so in my case I wrote that it flashed after it hit a certain size but it doesn't have to be so there's there's a very difficult to understand but basically it's choosing some branches to merge together and flash so here I think this example is using two different trees but you can do a sorted string file and a binary tree so yeah okay so some improvements that you can do um one problem that we have is that if the key doesn't exist in the database we still do a log and search of multiple files anyway so what you can do to kind of mitigate is to have this special data data structure called a bloom filter and what this does is basically it uses a probabilistic model to kind of tell you whether it has whether this key exists or not it's basically like a set asset much smaller than a set um and what you can do is it might still give you false negatives but what happens is that once the thing says that it's not in the database it's not in the database if it says it's in the database sometimes it's not but that's okay because we are just trying to cut away searching through the sorted string every time um yeah so there are okay so I mentioned about the strategies emerging compaction so you can merge depending on size I want to merge large tables together to kind of remove the duplicated key values again based on historical when did it come first or again the oldest the level tier where you merge the oldest tables so that you know that the most recent most most recent most commonly access data is always on top easier to search and then next we come into what I hinted at which is the B plus trees so in B plus trees I think okay I hope that my example will kind of explain it well how many people are familiar with the data structure of B plus trees no one okay very good so I can anyhow like to you but basically I I think I was taught this in university and I completely forgot so I had to google what B plus trees are and basically it's kind of like your binary search tree is more than one order so it can be like tertiary tree it can be whatever but basically you have a lot of you have multiple different keys and you always have one less key than your value so the key tells you kind of which different which different file to jump to after you are done so if let's say I put in for this example let's say a 7 so 7 is less than 13 so I will go into the left most one okay 7 is a bad example okay it's okay so 7 is less than 9 I go to the left most one 7 is above 4 so I will go into this self and then that's where I get my actual value okay by the by the end of the leaf no I should expect at least 7 7 is not there it's not there but if let's say I am looking for 10 10 is less than 13 10 is in between 9 and 10 so you go to the second one and then 10 will point to this and what happens when you insert a new data you go through the same thing you find a node so this is quite complicated I think this website expensive better than me but I will try my best when when we have our node what we do is we have two different kinds of inserts one is insert at leaf which means you actually inserting inserting the actual key value itself so only the leaf nodes which is the bottom tier of the just now that pyramid that you saw will actually have your key value pair every single other table is pointing to another table every other node is pointing to another node so you you can kind of think of it as a series of multi-layer pointers pointing to each other each each representing its own file so the files can reside on the same machine you can reside on different machine of course it gets complicated if it's on different machines ya so so that's the insert at leaf where you actually insert the actual value then when we do the insert first we need to we need to search for where to put this value so once we find once we find the key and we know which leaf that we want to to insert it in so this leaf will definitely have space because of some some maintenance that we will be doing after the insert so we will insert into this leaf and then ya the key value so so this is where the maintenance come in once once we hit this once we hit this condition where the keys length becomes greater than the order length right no the key key length becomes the same as the order length means it's same or greater but because I'm inserting one at a time it's pretty much the same thing then now we'll create a new node and then what essentially this whole chunk including insert and parent what it basically does is I will split this node into two so now I have two nodes with half the half the number of key value pairs and then I will construct up its parent and then if that parent is too big I will construct I will split that also and then construct up its parent so from the bottom we see if anything overshoot the key value order so if let's say in this case this example I wrote I wrote the order as I wrote the order as tree actually this is a lot easier to I need to run this by itself first it's quite out it's like key right it is yes we do we do have to order do some kind of maintenance but this maintenance is quite different from it because it does a rebalancing over at each node and it moves it swings the tree to the other side this one is creating new parent nodes for itself every time it every time it splits so load for the RBI okay so with every insert here I believe I'm only pretty pretty printing some insert so from the second second third and last insert so at a very top level it's fairly simple from here it's only we only have one we only have one file right and with two keys by this point we have two keys ready and two values and after that the next one at this third one once I insert the third one I realise that I hit the order three so what what happens is that I split that file I split that node into two new nodes which becomes this is our original node and then this is the new sibling node and then I create a parent on top which points to my file zero and my file one so once at the end we have a structure it's quite messy but if we actually want to trace let's say we want to search all these keys are in test right so let's say we do a db what's this db is it just called b plus 3 yeah it's just called b plus 3 b plus oh no things b3 b3 oh no let's do a get what was the get again I think it's fine so the difference between find and search is pretty much it's just that find returns me the actual value search returns me the leaf node so I'll do that okay doesn't work why I guess it was not initialised we can can always do it again okay so we initialise a new one we insert and then we insert again actually I should be able to run all of this okay so now when we do b3.get no fine any yeah so we get the key and value so this this structure that I wrote allows for multiple different values for the same keys but it doesn't doesn't necessarily need to be that way so if we want to search for test for when we look into the structure here we should have printed better we start from the newest file which is not the new the new ones probably and then we we find that this string is larger than here so we go into 5 and then we jump to 5 and then it is larger we jump to it is larger than 3 but smaller than 5 so we go into the centre one which is file 3 jumping to file 3 somewhere okay file 3 is here and then we get we get a key that is smaller than S5 is it not printing? okay the key is here so the value just comes from here which is 1 to 1 map to the key and value so that's that's what all this a lot of nonsense code is doing is to maintain the data structure of the B3 2 minutes it's okay we finish all the examples so so if you are looking at the recording all the best the slides are there slides.com.stats.jty.inc okay so yeah B plus 3 is actually an older technology than LSM 3 and it kind of yeah so as we mentioned the nodes you can call it pages you can call it nodes in the database it makes more sense to call it it's the correct technical term to call it pages basically as I mentioned other than the leaf nodes everything is just pointers to other nodes and then whenever you overshoot the size you create a new node and they modify in place so this is one of the few this is one of the things that we must be aware of so unlike the structures that we call so far this one does not maintain your historical data unless you continue to store values in the same array like I was doing and yeah so building from bottom up is every time you overshoot the order you split it into two and then you create a new parent okay next so some improvements that was mentioned in the book I'm not sure if I remember everything you can do as write ahead logs so now because you are just maintaining this data copyright you don't know what order that the keys come in so you might want to have some way to undo whatever you inserted so that you can actually remove it removing from the tree is still a kind of log n it's not log to n anymore because now there's it depends on the order that you have but it's still a log n and then you can remove or you can change whatever that you did you can also copy on write scheme so you can whenever you write you create a copy so that you can go back to the previous examples copies of the same segments and so only the branch that you access needs to have thread logs because now you have pointers to each files you can actually manipulate different sub branches of the same tree at the same time the keys does not have to be the full just now I was using the test one test two test three you can abbreviate it you can find a way to shorten maybe you only need the last because everything in the word has is the same you can actually you have some clever logic to remove the repeated things in the same table and then yeah so when you create the place when you create