 Please speak loud. Take it away. Okay. Thank you. I'm going to ask you here, everybody in the room is familiar with the machine and it's coming here to you. A lot of information about the machine ecosystem. So you're probably familiar with the machine, solar and elastic search. Within this community, ContiVee is closer to UCI. So it's very, very close in concept to UCI. By which I mean if you want to use a search engine using ContiVee, you're going to have programs that use ContiVee as a library and you're going to use... So if you follow the building blocks in ContiVee to actually get a search engine in there, it's not like an official search engine. It's written in Rust. So I don't know if everyone is familiar with Rust. If you're familiar with Rust, can you open your hands? Oh, wow. It's kind of just getting popular, I guess. So for those who are most familiar with Rust, it's a programming language that is kind of young. Usually people will try to tell this. But within all the... To give you an idea of what it is, you can... Rust people will take me for that. You can consider it as a... C++. So the code that you write is basically safe. And your program is very similar. It's very difficult to tell about Rust. And C++ code, if you're just looking at Rust, it's not something you can write. You will get the exact same performances. It's... yeah. So you do get white safety, and you get the great ecosystem, and you get great cooling, and this event is better. It's simple to learn. So I'll get a C++ code. And look at this function, I think. So potentially, the Rust is a... ... which is also something that you can actually compare with Rust. But also you can compare on platform that are not... ... So you can actually get a C++ code. And I actually even provide... ... for WebAssembly once. So that's not something that I probably... ... talk to you about too much, because... ... WebAssembly at the one point... ... ... It's still huge. It's very important for me to keep the... ... quite small. 40,000 mind of code. Even if you include all of the different crates... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... It's fast, so I put a link on the benchmark here. I think we won't have time to go through the benchmark, but there is an interesting benchmark that, for example, the performance of Tansili per query can look like for a different kind of query. Usually it's win-win-win. You can only compare it with different versions of Tansili and Cd7 right now. But what's really interesting is that you can actually identify, for example, for this query, the Cd7 is actually faster. What's going on there? So that's a great way for me to explore different optimizations that are in the Cd7 and do not have to set it in particular. I can know what would be the next step for Tansili. I suppose the other way around, for some query, like Tansili is abnormally faster than the Cd7. Maybe that's a good thing that there is only Tansili. So I suspect that everybody in the room is... So I'm not going to talk too much about Rust, actually, and that, like, from now on, we're going to build a search engine together. And as a matter of fact, what you do is only to identify a Lucien. All of the tools that we're going to talk about from now on are very valid for Lucien as the stream of the same design. So hopefully, at the end of the day, we'll be able to understand Lucien and maybe a little bit better. So yeah, that's really a search engine together. Most of the time, our objective will be we want a search engine that is... I'm trying really hard not to use those games here because we are still talking about a search engine that is sitting in one machine. But we want to be able to search and index terabytes of data, a lot of data, on a machine that is... that only has gigabytes of RAM. So we want to make sure that all of the students we are doing are solid and they don't require you to require a gigantic amount of RAM. And so... Hopefully, we can do that. I actually tested it at something that I did. So right now we are at Tansili 0.8 and I did test the first case with Tansili 0.6 and what I did is that... So this is probably a regular data set called Command Core and what it is is a kind of a snapshot of a core of the web and it's a pretty large one, a decent one. It's around 3 billion pages. So I took this snapshot, I downloaded it with... To be accurate, I downloaded half of it. I downloaded half of it using my own personal internet connection, using my internet provider base. It took like around one month full time. I mean Japan, so we have like optical fiber and stuff, so we did pretty good in terms of scale. Yeah, and I wanted to test if I could index it. See what happens because like... It's a very good test because like in your home especially I go to the house in Japan and the electrical network is actually great for my usage. I mean, I'm French, so I drink coffee and I have ghosts in the morning and when I would switch ghosts at the same time I would already have a talk at my home. It's a great test point for a change of window. Next thing is also... Yeah, so I do that and I put the Python library on the process where basically the goal of the Python library is to extract... Like you give it a query and there is a placeholder so for instance you would say French people are and then you ask for all of the adjectives that follow the centers and so two libraries together 20 people of this small L and B N and Q library made it possible to look for the most common adjectives instead of French people. I think it's pretty accurate. Let's move on. Can you... Can you check if you know that, please? Okay, so usually that means that you're going to want to build an inverted index and an inverted index is simply something that associates to each term a list of sorted... a list of... a sorted list of the queries and the simple data structure is sufficient to actually compute any kind of Boolean query. So I'm going to explain rapidly how our intersection works. So assuming that we are trying to find all of the documents that are matching the query Ronay and Magritte we are going to have like this two posting lists. We take the first one. The first document of the first one is two and we are going to seek into the second posting list for the document two. So by seeking what I mean is that we are going to do exactly as if we are advancing going through the... scanning through the second posting list and we are going to stop at the moment where we find the document that is greater or equal to the document. If it's equal then that means that two is in the second one. If it's greater then it's not. So here it's not and we found the document five and we are going to like do symmetrically, we are going to seek for five in the first posting list. Here we go. Five was found so it's going to be in our output. We can have done the first posting list and now we are seeking for seven in the second list and so on so forth. Until we reach the end of one of the posting list. So the only take away there is that if you want fast intersection you need to have a fast way to seek through the posting list. For union it's... we could use the same strategy like basically it's like the merge. Everybody has a different name for this algorithm but we could merge those two posting lists and get the union but that's actually inefficient so I'm using the same trick as Lucine does. The idea is that you prepare in advance some kind of bit set here and so just for the sake of this slide the bit set is actually often eight but in Tantivi we're using four thousand ninety six bits so it's much larger. And what we do is that we are going to take the first posting list and we are going to... so we are going to append all of the docs that are between two and two plus so this horizon in this case eight but in Tantivi four thousand ninety six. And we append those documents in this bit set so in this bit set the first bit will mean two the second bit will mean three and so on so forth. We do that for the first posting list and we do that on the second time on the second posting list. Here is what we get for bit set and now we can flush it. So we transform basically this bit set into the documents that it represented. So it can be done quite fast because your CPU usually has an instruction to pop the lowest significant bit out of a 64 bit world so you just pop it and so on and forth. All right so that's how an inverted index is used to compute Polynquery. Now the question is how do we represent that on disk so Tantivi is relying on M-App for all of its I.O. and when you start Tantivi the only thing that it does is basically I'm mapping all of the file of the index so it goes really really fast and it's ready to go so we don't have to load any data structure and put that in anonymous memory and have some kind of hash map in anonymous memory everything is on M-App so the start up time is very very nice but that means that we need to have data on the disk that is usable as is for Java people anonymous memory is more or less like the heap. So the first data structure that we're going to have to you have in our index is the term dictionary so a term dictionary our term dictionary will be broken down into two steps one step will associate the terms like the sequence of bytes associated with the term to some kind of term ID and then we're going to have another data structure that associates the term ID to some kind of term info structure that basically like have pointers of two files and like the beginning of posting this I'm not going to talk about the second data structure because it's boring but let's talk about the first one so how do we go from terms, sequence of bytes to term ID so if you are actually trying to build your own search engine you will have two broad kind of family for this solution one could be like hash based in general maybe you will go fancy and have like perfect hash or something like that it's a very nice solution in the sense that we you will get like very fast lookups and especially if you don't have much RAM and everything is sitting on your hard disk you will probably require very little IO to do that your hash map will be able to send you directly in the right place on your disk another solution is using like a tree based solution or like a trial so this will require this will use slightly more CPU and you will have a lot more random IO so that kind of depends on the layout of your data but you tend to jump from one node of your try to another node of your try so it might be a lot of six and the data is on your disk but it has a lot of benefits so one benefits is you can iterate through a round of keys you can so naturally you will probably use term ordinals so by term ordinals I mean if arabica is your first word in your dictionary it will get term ID 0 the first word when sorted lexicography play and then the second word might be something starting by a B then it will get term ordinal B term ordinal 2 and so on so term ordinals your term IBD will be sorted exactly the same as your terms, that's a very nice property but more importantly you can use the intersection of your try or your try light structure with the DFA so are you guys familiar with what the DFA is? okay we'll still explain it a little I actually made a mistake in the DFA here but so that's one so DFS stands for deterministic that's one way to implement regular expression so you can transform any regular expression into an automaton like this one and the way it works is once you have it into this shape then matching a string on this automaton means that you're going to consume every charge in your string you start in the state I was hoping I could point at stuff but you couldn't jump so you start in the white state over there and let's say that you're trying to match carousel your first charge is C you look at the outbound arrows that are emitted by the state you are in one is labeled with a C so that means that after consuming the C I'm in that state, so yellow one and then carousel, second letter is an A I'm going to follow this letter this arrow and I'm in the blue state A, I end up in this state and carousel until the C without the L we'll be following the arrow with a star over there and then the last L will bring me to the end state so it's really nice because I'm advancing one charge at a time and I just have to look at the state of the end state of my string to say whether I matched or not I mark the state that matched by a double circle here so our try looks like this and it's actually possible to match this DFA over the try very efficiently so I'm going to show how it's done so here if we consume the C we end up in the yellow state just like for carousel if I consume the A I end up in the blue state if I get a B then I end up in this grey state which happens to be a sink so the sink is a dead end we will never match if we reach this state we don't need to see what happens with the following characters and that's great because that means that all of this subtree we don't care about it anymore for the purpose of fitting soon into the slide this try is very simple but you can imagine that maybe there is a gigantic tree that is a child of this node and we just cut that so that's much much faster and now we want to match M we don't have to recompute what is the state required for A we can just look at the state of A before so it's blue and we see that we are in the sink state again and so on and so forth so we are just putting colors of our try to guess which which term in our dictionary are matching our deterministic finite automaton that means that I can go through all of the terms that match a regular expression but also I can get very rapidly all of the terms that are at this distance or edit distance of one or two so it means that if you are asking me please stream me all of the terms that are one type away from what I typed I can do that very very efficiently so Tantivi is using a like tree based solution it's actually using a finite state producer using those two and I didn't have to code anything the rest ecosystem is nice enough that somebody already coded a very nice implementation of finite state producer I will not explain how it works because I didn't code it but things that you might want to know about it is it's pretty much like a try except that nodes can share suffixes so you a try will never have an error that goes like this right so because it can share suffixes you end up with something slightly more compact than a try and that's always a very nice feature when you're being spaced with your RAM now let's talk a little bit about how we are going to encode posting lists so posting lists are a lot of integers it's going to take a lot of the size of your index and you're going to want to compress them integer compression is a field that has been well studied there is a lot of solution to do integer compression I put a chart over there but basically my point here is you have a tradeoff between something that is compressing a lot so it's always like that right you have to choose with something that compresses a lot and something that is very fast another thing that I need to point out is basically all of the algorithm over there and yeah all of these all of the best algorithm they are all using SIMD instructions the SIMD instructions are instruction on your CPU that makes it possible for you to process 4 or 8 integers at a time so that's something that actually Tantivi is using a lot so Tantivi is actually using this guy over there so it's a it's a compression scheme that has been designed by Daniel LeMire I recommend you to read this blog it's always very interesting if you're interested in things related with search or data structures I used to depend on his library actually in C and I removed it because I prefer to be entirely in Rust and I re-implemented it entirely in Rust the gist of it is like many of those scheme your posting lists are increasing so it starts by doing something called steltown coding so instead of compressing your integer directly what you do is that there's a difference between two consecutive integers and you get something that is much smaller you take a pack of those so in my case blocks are 128 integer long and out of those 128 deltas you look at the one that is the largest so I think in this example and you notice that it can be represented using five bits so what you do then is that you represent all of your deltas over five bits and you just concatenate this version so that's called beatpacking so I describe what was the solution for like the scalar version of beatpacking what I do is actually I use SIMD instruction for that and I do that with four integer at a time and the algorithm really looks like this trick that we were using at school to avoid like writing lines when we are punished by our professor so it's exactly the same algorithm as the scalar solution you just use SIMD instruction in place of like those scalar instructions it's very very simple and there is an interesting improvement compared to the methods that existed before Daniel Amir's paper which is the algorithm also take advantage of the fact that as we are decoding those integers we end up at one point where we have one register with the deltas we are decoding and we can decode the delta using SIMD instruction as well so this is also SIMD instruction and we do that at the point where they are already in the registers it goes really really fast and we're talking about, when I say fast to give you an idea I'm talking about four billion integers a second so we are already very we're flirting with the bandwidth of your run so I told you that Rust was nice was able to compile very efficient code that it was very close to C++ so I wrote that in Rust and this is the assembly that is generated I'm not going to tell you that I understand what's in there but the important point here is that all of the long looking instruction over there, every single one of those is working on four bytes at a time and you will see scalar instruction maybe here here so the assembly code that is generated is really as good as possible it's very very close to it's probably exactly the same as what we had in C++ so I think we told you this morning earlier about BM25 and TFIDF these three queries to so those are like the scoring the default scoring function of Lucine Tantivi comes in with BM25 also this requires to have access to the term frequency as we search so when we match a document we need to have access to the number of time a term appears in the document and it's nice to have good locality for that so we actually interleave blocks of the KDs one block of term frequency so that we are likely to have it in a nice cache at the moment where we are reading the as we are doing the match as we are running the query so we have one block of the KDs and then one block of term frequency and so on and so forth each block is representing 128 blocks so unfortunately postings are not always exactly a multiple of 128 blocks so the last one is using a different compressing scheme that is not interesting and so we also need to have the information of how many bits are used to encode each block and we also would like to have some way to avoid decompressing this block if we are running an intersection for instance maybe the document is not in this block and in that case we would like to entirely avoid decompressing it so for that we have another structure on top of that that allows us to precisely skip and go to and decompress only the blocks that might be interesting we are going to talk about something else now I said in the beginning of the talk that we wanted to be able to index an arbitrary large amount of data using what terabytes and that's a big problem because if we have a reduced amount of RAM how are we going to be able to build like this nice data structure on this we have another problem which is people might create an index and they want to add new documents we want something that is dynamic we want to be able to add documents to an existing index it happens that the solution is the same for both of these problems so just to refine the idea of why it's difficult to add new documents in an index like that the data structure that I described to you on disk is extremely compact everything is super compressed everything is laid out on disk one after the other so if it was a bookshelf it looks like this and if you're in front of a bookshelf like that and somebody gives you a book and tells you to put that in the library you're going to suffer you're going to have to move everything away it's a nightmare what you want is to have something that looks like that if you want to update your bookshelf all of the time so generally speaking there is a trade-off between being dynamic and being compact so there is a nice trick that exists in many database that is called the logarithmic method the idea will be we're going to have like compact bookshelves like that but we are going to have many of them so the way it works is the user will tell Tantivi or will tell Lucine Tantivi I will give you a budget of let's say 300 megabytes and this 300 megabytes would be used for the dynamic bookshelf so there's going to be a big bookshelf and people will add books into that until the bookshelf is full once the bookshelf is full we are going to transform it into a very compact bookshelf so we can stop with the major but basically we serialize our dynamic data structure into the static data structures that I described before and this piece of like very compact index is called a segment so that's a word that you probably heard in Lucine but that's a segment and both in Lucine and in Tantivi a segment is basically an independent index so you could literally like pick a segment and copy it into another index and it will run just the same as long as the schemas are the same that's a very interesting property actually so if we do that we are going to end up with a lot of compact segments and maybe that's not optimal if we have 1 terabyte of data that would be a lot of small segments and when we do a search we are going to have to do as many lookups in the dictionary and that's very inefficient right so on the background we also have a process that merge those segments together and you can actually control you can define on your own as a user what is a strategy used to merge these segments but it comes with a default strategy called the Long Merge and the heuristic behind it is just we try to merge by default it's going to be 8 segments that have about the same size and so this technique is also really nice because that means that we can do multi-threaded indexing very very easily so Tantivi asks you how much data you give as a budget indexing thread you want to use and then when you open the document what you are doing is that you are pinning the document to a document queue and in the background you don't control it but your given number of indexing threads that are consuming this document queue and that are populating this dynamic structures that I was talking about and once they reach their capacity then they are going to serialize the segments the serial segment is the very compact data structure that we use in the end and some merging thread will merge those very transparently yes, thank you so there is a bunch of pros and cons to this one problem that is very evident when you use Tantivi and I think it is still true to some extent it is true with Lucine when you add the document it is not searchable right away so it is quite puzzling for people who are using SQL database because we decided to have our segments to be independent index they do not share any dictionary and that is sometimes that makes specific use of search difficult not to be able to actually merge the result from the different segment very easily yes, there is not one dictionary I did not talk about deletes and updates but those are a natural nightmare, it is very complicated yes, that is pretty much all I have on this slide and the pros will be the index throughput is actually excellent if you can batch stuff if you do not have to commit all of the time you can re-index a gigantic amount of data so seeing on my laptop I index Wikipedia in I thought I rest for 2 or 3 minutes but that is the amount of English Wikipedia but that is the speed we are talking about Wikipedia is like a tiny benchmark in search it is not a big dataset yes, so segments are independent index so you could decide to actually do your indexing on Hadoop and just copy the file of the segments in the same place and everything will work just the same or you could decide to do a distributed search by sending the segment and dispatching them it is really just copying files it is very transparent and because we write files and then we never touch them again we just write very large files and then they are read only there is no like problem of flux or anything readers just like read the file and nobody is touching them so it simplifies a lot of my work I have a bunch of slides there let's keep then so that we have a bit of time to to ask questions so do you guys have questions? I will try to say about Lucille is a little bit of a corner case for Java it is like basically a library which is difficult for Java to handle because there is lots of low level so one thought your project and Rust does it fit together or did you find trouble using Rust as beautiful so I wanted to see the little card so the question is is Rust building a good fit I guess especially in contrast with Lucille and Java so I would say yes so the main benefit that I get from Rust I guess compared to Java is first I can access SIMD instruction second I have a lot of control of I know when I write code whether I am going to get static dispatch or dynamic dispatch but something that is extremely hard in Java you never know what will happen also all of the things like M-map being a nightmare in Java it is something that I don't have any problem I don't know if everyone knows what I am talking about basically when you are working with Orfit memory in Java if you are M-mapping a huge amount of data and you have very little data that is on your heap your data will be unmapped only at the moment where the object that is holding the M-map thing is garbage collected and garbage collection happens only if your heap is actually kind of full I think that is a big problem some JVM give you a known official Java API to M-map stuff some people just decide to use GNI code to do the M-map I never have that kind of problem with Rust everything I am working with C or C++ when I started this project I did not know Rust at all and I started on TV to actually learn Rust which is a bit stupid so I already knew C++ and I had done a lot of Java as well and after around 2 weeks you do not have to believe me but I was more proactive in Rust and I felt safer writing code in Rust and I was in both Java and C++ can you explain the name and the logo the most so Pantilly is an English word I often do that I like choosing English words that nobody knows as project names they are actually not bad at all for SEO and yeah people learn a new word so everybody is happy so Pantilly means at full gallop hand the horse and I drew the horse myself I think that is one of the best achievements in the forex yes