 Okay, so we're gonna be starting with the next talk right now, so please sit down take your seats and Rondo boss for Cuba. He's gonna be talking about good databases and indexes and interesting stuff. So Rondo plus for him Yo, hello, I'm kuba. I work at sourced many on projects related to storage and data processing like my SQL server and Get database and the engine but also in a spare time. I maintain couple opens with projects like Extended attributes package for go and the zip compression library and recently I helped my friend in his new programming language called never Bad today. I want to talk about How we integrated the bitmap index in our Geed database, so I have to give you first some context So Geed base is application But just it's just a front end the real my SQL server database engine Which is by the way powered by Vitas is another opens with projects Geed base is so far just a read-only database So no insert no updates and all the queries to go get repos are done by the Go get package the pillow. There is another opens this project which we use it's a distributed index Implementation it implements the roaring storage format and apart from indexes. It lets you keep attributes in bold DB So the Geed base is just a standalone like an application which you can play with just a front-end and All the well-known terms from the Geed like repositories commit refs are mapped to the tables and a convention Convention is just how you the reference are based on the convention and the naming convention So basically you can play with all the features like natural joins and for instance this query Uses the natural join So you want to get all the repositories where the Alan Turing contributed on a head reference And so you just do the query from revs and natural join with comments More sophisticated query is this one which basically uses our predefined function which let you extract some abstract syntax tree and You can recognize the language based on the content and stuff like that So this query tries to extract all the identifiers the identifier names from your go files But of course we have indexes indexes are the most interesting for the stock So we have indexes and you can create an index on wire or multiple columns This is simple stuff The nice thing is that you can specify what the index driver will you want to use so the red keyword? Pilosa means that you want to use the Pilosa as a index driver Also, you can create the indexes asynchronously and synchronously But you can also instead of creating the index on one or multiple columns You also can create the index on an expression So a little bit theory about the indexes Till we dive in is that we have a hash indexes which work like a hash maps It's a very good for equality the most popular indexes in databases were tired be trees and Most most places is this is very good the best approach. They are self-balanced trees pretty well Pretty well choice if you wondering which one to use our trees are not super popular. They are more index multi-dimensional objects, so they are good when you try to group nearby objects They are also self-balanced trees and beat maps beat maps are very good to optimize logical operations And this is what we what we chose for git base Beat maps can be very expensive to rebuild the whole beat map in some time So so it's it's a good choice for for example for read on these systems Also, the best what you can get is when you have a lot of data, but not Not many possible values And Also, the good thing about the beat maps is that because it's optimized for logical operations You can create the one one be index per column to satisfy all queries If you don't have beat maps indexes the formula on the right is Is the number of indexes which you have to create to satisfy all the possible queries where n is a number of columns? So for example Beatrice With the Beatrice you have to deal with this So I mentioned a little bit about the storage for a storing of about the format how it's stored So it's called throwing Roaring format and a pillow so which we use the they implement this and this format and apart from the all the offset Which you have to set and headers in a file The most important is how you store the data on a disc So first when you have the row ID and the column ID in your in a new beat map You have to store to the local storage So you calculate the position you call the ad function and What ad function does is basically the first bite is the operation. So in this case, it's a it's at the next eight bytes It's a value In a little Indian format and then last four bytes is just a checksum of the operation and and your value and you write it to the To the disc basically this pretty simple how you save data and pilosa implement it It's a typical by default. It's a typical server client But we hacked a little bit which I will talk later By default, it's a typical server client where communication goes over RPC or rest API and The data model in pilosa looks like this. So you have a huge Boolean matrix Where index Contains fields and then in rows in this matrix are local Per field But the columns are global columns are shared across all the fields. So you can say give me Column number five in the index and you'll get the all the ones or zeros from all the Fields, but when you say row number five you have to specify for which field So rows are local Per-field and their columns are global and nice think about the beat max is that you can match them Well, you can match rows, but what's the most important is that you cannot match rows across Indexes you have to match rows Inside the same index So how we integrated the pilosa in git base? So we the first approach was the simplest one Yeah, take the pilosa run it in a container as a server one pilosa index maps to the one database index and One field was the equal to one expression. So the column and it was super simple We had to add the mapping because you have to map the beat maps is just a Matrix so it has rows and columns, but you have to map the values to rows and columns to location or number of the record in your table So it was the simplest approach But after some time we decided we don't want to maintain yet another server and Trouble shoot all the synchronization problems and stuff like that So we decided maybe it would be good to hack pilosa a little bit and Embed it into the git base. So it let us run the pilosa like a like a Communicate with pilosa like we communicate with the libraries. So this is what we did We implemented yet another index driver that time we called it pilosa leap we Got rid of the server part extracted the API And basically we tried to prototype some stuff first of all we had to deal with all the storage nodes We had to open and close first so like a index fields and fields contains the views and fragments can cash and stuff like that But we first we found a holder which was kind of the nice thing because holder let you open the whole hierarchy Just by calling the open and close thing and it does all the thing for you. So sound Sounded promising at the time, but suddenly appeared that we had so many problems and random crushes In this concurrent words with the holder. So we took a closer look how the holder was implemented And apart from all the whole Handlers and then they filled it contain two channels open and close closing these two channels are only used in these two functions open and close and They close the opening open channel in an open function and the close the Closing channel in the close function. What means that in the concurrent world? It possible that you try to open already open file or close already closed file and Basically instead of do nothing call return the error It panics and please I really like filosa don't get me wrong But if you are doing something like this, please don't it's bad. It's bad by default. Nobody should do it so so How we solve this problem? We decided to get rid of the holder. It's the easiest way So now we have to really manually open index and fields But the opening and close functions for fields and indexes. They don't have this problem that when you open once again Nothing really bad happened So yeah, so we got rid of the holder and we did some improvements One of them is that we right now have one pilosa index per table. So multiple database indexes are contained in a one pilosa index and One field is per Index expression, which can be the column name and the partition. This is something what we introduced recently Also, what we improve a little bit that we still have the mapping in bold DB But we use the bucket sequencer to get the next possible available ID Interprove a little bit performance, but also we have to encode and decode data. So we use the Gov package from standard library But what's the good thing about having one pilosa index per table is that right now all your indexes are in Database indexes are inside one pilosa index so you can match them So how did it works this merging stuff? So let's start from the scratch You have to create the index so first you have this huge blend Matrix and let's let's assume that we want to create the index on a column a and to be So you create the index then you iterate through all the expression a and b and for each of them You create the field let's assume for simplicity that leads to partition just a one partition And it's still the same index ID So you create two fields you have this huge matrix full of zeros then you have to populate it by data So you iterate Through columns and every column. Let's say every column is a new record in your table So it's kind of that looks more or less like a column database You iterate through the records and you iterate through the columns So you put the one you set the bit one in every column But you have to figure out in which field in and in which row So you go to the field which is mapped to your Column and try to get the row ID for your value if value already exists that get row ID function Will you return the row ID which already exists? If not, it will generate next ID for you so after you get that you add you set the bit in a field and you put it in your Put the location in your mapping So this is how you save data and then you have some ones In your matrix. So now you want to query when you query you put some filter like you want to get all data Where a is two and B is four so you do once again you iterate through the expressions go through the fields and then you have to Find the row ID where the value two and the value four are and for simplicity, let's say that Value two is assigned to row two in a and value four is assigned to Row four in B So at the end when you already collected all the rows Which are assigned to your values Because it's an end you have to do intersection. So you intersect rows and as a result you get These columns where the one was on the same positions because it was intersection. So you get columns It's a three and a five as a result you go to the mapping once again to get the location We where are your records in your tables? Which are where assigned to values three and five? This is how the magic works, but what's worth to mention in my opinion. What's the good thing about the beat mapping dexes is the matching thing and It doesn't make sense with beat mapping dexes to create Indexes on multiple columns It's better to create the one index per column. The reason why is that if I created the index on Columns A and B so on two columns Then I can the index will be used only if I do intersections and The intersection will be used as an internal operator operation first But if I create the indexes independently one index on a and second index on B then I Will get the same result The intersection will be not internal operation, but the external operation later But it will not have any bad impact on performance But the good thing is that if you replace and by or then it will do exactly the same instead of Intersection will just call the union and that's it So it doesn't make sense because creating the index on multiple columns. It will only satisfy and filters So how it was done because I mentioned that we have pilosa and pilosa leap drivers But it's also possible to implement on index driver because everything what we did we did through the interfaces and so first of all you have an index driver interface which Just has this basic functions Which let you create load save and delete index and this function return the index interface Which basically as two main functions hasn't get which can answer your question Do I have this value and if I have the value? you can get the value and to get the value I Called the function get it returns index lookup which more or less is kind of a wrapper on on a value integrator Also, we have some indexes which let us query by ranges like a Austin index So you can ask questions greater or less than etc But they also all of them they return the index lookup lookup And the index lookup as I mentioned is kind of the wrapper on a on a iterator It just gives you the iterator based on the partition But the index lookup may Implement set operation interface, and if you implement set operation interface Then you can match lookups by intersection union difference as a result. You'll get another lookup and Get the get the values. So this is how merging quirks and the last but not least is mapping mapping is important because basically with the bitmap indexes you have just a Row and columns. So you have to map values and Trichords to columns and row IDs. So the mapping works like this you You want to get the row ID for your value? Row our local rows our local per field. So you have to specify in which field You pass the value. So which first we encode the value by gop encoder And then then we get the bucket in a bold db Where which is assigned to the field and we try to get the value from the bucket If it exists Then we immediately can return the ID row ID if it doesn't exist. We take a next available row ID calling this next sequence and We once again encode it to the little indian format and we put it in a in a bucket For bold db and of course return from the function So this is basically how it works Sima function is to get the location from column ID instead of encoding redecording Colors IDs To get the location and this is basically how it works so Maybe I can demo some stuff if I have a time. It's like a five-minute left If you have it's better. It's a question like do you have any questions or I can demo something how magic works Because what can go wrong with the life demo with 100 500 people in front of the and And live streaming nothing that can make happen So demo or questions Okay Big girl Really say stop. Okay. I think I already compiled it. So server github a server. I Have some small repo Here in repos indexes will go Texas, okay, okay. I Putted some debug functions so we can see here how it works. Basically I'm still debugging with print lines I know how many of you likes debugging with print lines Yeah, and I'm just like we are passing almost first quarter of the 21st century, but we are still debugging with print lines I really like that and I hope Derek will tell me how about improvements on mark Yes, but so far print lines also save my day. So, okay All right No possible for Okay, let me open I have some So first of all I'd like to show you One of the tables which is So I will test against some table which is called comic files And it contains like a couple like repository D commit hash for hash and you have a one just a one repo Called basic and it has tons of stuff So right now Thank you So right now I will create the index first index Okay, three indexes AB on columns on two columns and A on one column and B on the second column, okay This is how this debugging works So basically Basically, it creates the fields in your index it and it populates by data and these numbers are So every row has these numbers and these numbers say in which column the one was set Because I increase the font so much. So I don't know if what I want to say So here for example, we have a We have a field Which is assigned it to the file path in index AB, but if you go to the Go to the field, which is also file path, but in index A or B it will have the same data so basically Having also indexes on two columns and on just someone one column it duplicates data So it's better instead of duplicating that I created one index per column and So let's say we want to use this select to get To get the comet how to get everything where the comet hash is this and file path is that and We call we call explain to explain how the query will work and it says it will use indexes a B if I replace and by or It will use indexes just a and B so The reason why I said that it's better to have independent indexes instead of is that I Drop the index on a B Can drop and once again call explain and this time Two indexes will be used instead of one because I drop the a B So right now we can we can use indexes a and B independently and it will also work With or Okay, last thing is that Let me query Query did gave result and this is how it work We get two rows from the bitmaps We intersect them. So the position the only position zero zero is the position where they in In both bitmaps where one was set So we get the just a column one and as a result. It should be just one Just one record Okay, and this is this one I'm done. I'm run of the time. I'm sorry. That's it. Thank you. Thank you Do you have I'm not filled with an apple today. I'm not I haven't This morning I was filled with an apple Where are you ladies perfect for me and I like that Oh, I'll figure it out. Don't worry