 Welcome, everyone. We have now GeoM at his group from Science from Media Lab talking about alternative indexing structures for graphs. So starting with a simple index, they had requests or had a requirement to have a really scalable and very capable index. And they want to talk about how to develop and design such a multi-level index today. So welcome. Thank you. I'm looking forward to your talk after I'm back for a minute. OK, that's fine. Thank you very much. Hello, everyone. Just before we start, if some slides are not very clear from the end of the room, we have put the slides online. So that's for your convenience. Well, I'm going to pose the problem we have tried to solve. And I need to explain what is high. I will try to make a one-minute demo. And then I will explain why we had this problem that we tried to solve. So yes, these are the slides. Why is it selected? Yes. So HIF is a web crawler. So it's about downloading web pages and looking at the network of hyperlinks. It's made for research. So researchers will typically download the network of web pages and look at it. So let me just show you how it looks like. This is my, oops, nope. Sorry, can I just? Sorry, not a Mac user, so. OK, perfect. So I cannot make a real demo, of course. But you can look at this demo. You have the links. And it's easily findable on Google. I tried this morning a corpus about the FOSDEM. And what I did is just crawl the FOSDEM website. And then the website around, as long as they contain the word libre or open or free. So basically what you would do is HTTPSFOSDEM.org. This is the FOSDEM website. I would just declare a web entity. I will explain what it is. And just crawl it. It's run on a server on Sciences Po. But the interface is on the HTML5 interface. And then it will crawl and crawl since I've done that. I can show you the network that we obtained after a moment. And well, let's put the colors. So for instance, here I have tagged the websites. And you can see that the website with Libre are in purple. The one with open are in blue. And the one with free are in green. So it's not very, very interesting. But at least you have an idea of what we try to do. Can you just put me back on there? Yeah. OK, so the main issue here is that we have a huge amount of web pages. But web pages are not very usable because we have so many of them. What we intuitively reason on as a researcher is something such as a website. But not always a website, unfortunately. So we have to work on the structure of the URL so that we can have an intuitive notion of website that we will call web entity. So basically, we tokenize the URL in its different parts and we reorder it. That's pretty common into what we call the LRU, which is reverse URL if you want. It's not completely reversed. It's from the most generic part to the most specific part. So the first part before the path is reversed. Actually, they look more like that. But for the sake of simplification in my slide, I will just stick to something a little bit easier to understand. So if you do that, LRUs have a tree structure. Here we have different pages in Wikipedia, three articles. And this one is if you click on history in Wikipedia, you have the history page about fish here, I believe, no cat. So it doesn't have the same structure. But at least the first part mutualizes between the different URLs. And then you have the slash wiki slash article site if you want, that we call web entity. If you put it here, you will have the Wikipedia entity. If you put it after Wikipedia at en, intuitively en.wikipedia.org, you have the English Wikipedia. If you put it on bird, you have just the bird page, which is the entity that is relevant to the analysis. And you can, of course, have multiple flags. So if you do that, you have Wikipedia and also different articles that might be different entities. So let's flesh that out with some typical users. Our first user is Audrey. She wants to study the web in the sense of actors. She sees the websites connecting together. So what she would have in this case is different domains and possibly Facebook profiles. In another use case, Bernard wants to study animal, it's crap, whatever, animals represented on Wikipedia. And in this situation, it would be about different Wikipedia articles as different entities. And we can have sort of a meta flag spawning flags at the next level so that we know that all Wikipedia pages are different pages, different entities. And finally, an even more common use case is someone who wants to study actors and documents. So some of the entities are documents and other entities are actors. So in this situation, some websites are considered as actors. But also, this researcher wants to look into some articles of the Guardian and pick them as relevant. So she's studying migrations. So some articles about migrations are relevant. The others are not. And they should be considered just a part of the rest of the Guardian website. So if I put colors, it means that with a flag, you declare sub-web entities if you want. But they are considered different entities. And the rest, so you have this verification of entities. This is necessary to the kind of analysis we want to do in the social sciences. And this brings a lot of complexity to the way we can structure a memory structure. So what we need is to add tens of millions of allaries. That's the kind of volume we want to address. Hundreds of millions of links between web pages. We want to edit the web entities afterwards and not re-index each time because the users want to change their entities because they don't agree with what the software does by itself. For, in a sense, researchers always want to tweak their results. And we want to know, we want to be able to get all the pages in the web entity. We want to be able to know in which web entity is the given page and to get the graph of web entities. Because this is the purpose. We want to know, at each point, the graph of web entities, even though we move the flags and redefine the boundaries of the web entities. So it's a tree because of the LRU structure. It's a graph because of all the links between the pages that we use to infer links between the entities. How do we implement that? OK, I'm going to give the mic to Mazama. And I am going to present to you the first implementation that we did. Sorry. So I'm going to show you the first implementation we tried that was started at the beginning of the project seven years ago. It is simply by using the scene, as you know. Because we want a tree. And the scene allows you to create an index of elements that are searchable by prefix. So when you have a tree and you just want to collect your pages, the tree of URLs is just simple prefixes that you can query. A graph by just indexing a couple of pages, like source, target. And then you can just query all links to this target by querying the prefix or querying all links from this source. But the problem is not only the web entities are dynamic, as Mathieu explained, because users want to put flags at different places at some moments. But moreover, because of that, links are aggregated. Because we have links between pages. And so links between web entities are all the links between the prefixes of a web entity that are aggregated together. And so as Mathieu said, we cannot just store them. We have to compute them regularly when things are changing. And also, do you remember Bernard? He has web entities here, but also here. So there are web entities that are sub-web entities, which prefixes are included in prefixes of other web entities. So when you're going to query, you're going to find the same prefixes embedded in each other. So this create another limit is that the querying with Lucine now is becoming a lot more complex and a lot more slower, because as soon as you have sub-web entities and sub-prefixes, you need to do queries with and add not-closes to your queries to say, I want to follow this prefix, but not all of those before, or not all of those behind. So this makes the queries a little bit more complex, a little bit more slow. And so as turnarounds, we started caching a lot of things, caching in RAM, but also caching in Lucine by adding an intermediate level of an index of links between pages and web entities. And this, so we were querying in the first time, storing temporarily those intermediate objects, and then querying again these ones to get the links between source and target web entities. Result was that. It worked for the first four years of the project, but as soon as we had huge corpuses, they were just getting slow. And indexing was taking so much more time than actually crawling. So the crawlers were finding many pages, and then when we want to index, we are getting some stack time, and sometimes we are just saying the researchers, well, start the crawls, come back in two days, and then you can work on your corpus. So this is why we decided to go to Copenhagen for a week, where Mathieu spent some time as a researcher, I don't know how you say that, but a visiting researcher. And so we went to see him and tried to work for one week together on tackling this problem. So four brains, a lab called Tantlab in Copenhagen, and we worked simultaneously on two prototypes, two concurrent prototypes to try and solve this problem. One using Neo4j, and one trying to rebuild from scratch in Java, another structure that would be more in a tree because we had this feeling that the data that we're manipulating is really trees, and so we should probably store things as trees. So that was one week of intense work, all working together in those clean furniture from Denmark. That's all the commits that were made on the two prototypes. I think the top one is Neo4j, when the bottom one is Java, and so you can see all the time when we worked at this time. And this is a visualization of all the beers that we had multiple times, because in Denmark maybe you know there is this brand called MiKiller and you have to try it. So this brought us to the Neo4j prototype. Okay, sorry for the intermission. So as Benjamin said, we tried two prototypes. So the first one was actually using Neo4j. Why is that? Because like actually you saw, we have a graph, a rather complex graph, and Neo4j is a graph database. So we told ourselves, this could be a good idea. So it's a tree, it's a graph. So this is actually the schema of the Neo4j database we devised. So each of those nodes are actually a stem of the LRUs, like the reversed URL. So it's just like for instance, here you've got HTTP, here you've got something like Google, WWW, and here you've got like actual pages. So you've got the tree structure, and between those pages you also have links because this is the web. So this is quite difficult and a bit cumbersome. So the challenge we had was like being able to like insert the data in the database and like being able to query the data in the database. So we had to rely on really complex query and so some takeaway bonus, unwind is really your friend. If you have to do conditions in Neo4j, is your friend, radius, case, coalesce, and we even tried some stored procedures. So for instance, this is like a simplified version of the query we use to insert the pages. So this is a bit cumbersome. Here this is two examples of excerpts of queries, cypher queries we use to like compute the graph of web entities sitting on top of the pages. So here for instance, you see that we even tried like stored procedures which were like developed by Benoit, which is somewhere in the room here. And yeah, so as it seems, it's not as straightforward to traverse trees in Neo4j as it seems because we have a tree, we have a graph, but it's not really, yeah, it's really useful to make like graph pattern queries. But if you have to like query a tree, go down the tree, go up the tree and so on, use some depth first search, is a bit cumbersome. So we went on to a second prototype, which is called the trough. So for this trough, what we did was design our own on file index, like to store the complicated multi-level graph. So people told us not to do it. That's the base. So yeah, it seems crazy. So building an on file structure from scratch is not really easy. We are not really experts of this subject. So why would you do that? Because you have like a lot of solution, already existing solution. What if the structure crashes? What if the server like just shutdowns and your data is lost? So it's not so crazy. Why? Because like you cannot get faster than a payload data structure. It's just a fact. We don't need deletions, which is a huge win because a lot of the complexity of custom structures is always related to deletions. And we don't need an asset database. For us, it's totally overkill. So we just need an index, but a custom one. So an index does not store any original data. Why? Because in our case, we've got a MongoDB, which stores all the actual data in a reliable way. And this means that the index can be completely lost, destroy, utterly destroy. We don't lose any data. We can recompute the index. So no biggie. So what's a trap? It's a trap. So the trap is a subtle mix between a tree. So I know people called it a try or a tree. I will call it a tree and a graph. So ends the incredibly innovative name. So as you remember from last time, so we've got a tree of errors, which are like the reverse to your real thingies. So on this thing, you have like one character is actually one node in the tree. So this is the basic thing we did. It's not optimal, but we will see that later. And on those nodes, we will be able like to plant flags, demarking the web entities boundaries and so on. So basically the thing we are going to do is just like build a tree, which is going to store at character level, the LRUs of our graphs. That's all. How do we do that on file? So it's quite simple. We use like fixed size block of binary data. So for instance, 10 bytes or 80 bytes or so on, we decide something. And because we have fixed size and some other useful thing, we can like access the blocks in the file in a random access fashion. So it's quite fast, especially if you have SSDs. So accessing a specific page's metadata is actually done in O of M, M being the length of the actual LRU. So this is an example of the binary data we like created for storing the tree nodes. So we have like the character data. We have some flags as bits in the byte. We have some pointers like the next thing, the child, the parent and another thing, which is a pointer to outlinks and inlinks. We are going to see that now. So we have a tree, but we also have a graph of pages. So how are we going like to store this graph of pages? Well, the second part of the structure is actually a distinct file, which is going to store a bunch of linked list between pages. That's all. So we just need like to store, to be able to store links out and in. So this page points toward this one and this one is being pointed at by this one. So we store this the same way, using fixed block size of binary data. And it's even more simple. We've got like a target, a wait and a next. This means that this block is actually a stub. We just have the target. We don't have the source of the link because the source of the link is actually like pointing toward this file. So we have got like a pointer in the tree, which point toward another file. And in this one, we have like linked list of target, wait, target, wait, target, wait and so on and so on. Quite simple. So now we can store our links. So we have a graph of pages. So we have the tree of alleroos. We have the graph of pages. So we have a trap. So what about the multilevel graph? Because what we want, if you remember correctly, is not being able to query like a graph of web pages, but be able to query the graph of web entities which are aggregating and sitting over the graph of pages. So what we are going to do is select some nodes in the tree structure and flag them, telling the tree that this is the beginning of the specific web entity realm. So it looks like that. So we are going to plant a flag here, here, here and so on. And we are able to like determine when we are here, which web entity we belong to. Yeah, so now finding the web entity to which belong a page is obvious when we traverse. And what's more, if we bubble up, like for instance here, so this is more, yeah, clear. For instance, if you go from the link and you go to the target of the link, you will arrive somewhere here. Well, you just have to bubble up towards the nearest flag and you know which is your web entity. Actually, we don't do that because we don't need to, but it's a possibility which is efficient. And this of course can also be cached in RAM if we need to, to speed computations. Okay, so now what's more, if you want to like compute the web entities graph sitting on top of the web pages one, you just need to perform a simple depth first search on the tree. This seems costly and usually it is, but here there is no other way around because you have to do to scan the old database at least once. And since the structure is quite lean and light, you won't need to read so much. Okay, so the question here is, was it worth it? Okay, so we did a benchmark on a small corpus like a sample, 10% sample of a sizeable corpus about privacy and data privacy. So we have like one million pages, five million links, 20,000 entities and 30,000 links between those entities. So we did the benchmark, we dropped the trick and Mathieu like presented the results of the benchmark. So to index the thing, to insert pages in database, Lucine took us like one hour and 55 minutes. Neo4j took one hour and four minutes and the draft took 20 minutes. To process the graph, to be able like to query the graph and get the aggregated graph, Lucine took 45 minutes. Neo4j six and the draft took two minutes and 35 seconds. So for now we are waiting. On the disk place, Lucine took 700 megabytes, Neo4j 1.5 gigabytes and the draft one gigabyte. So Lucine seems to win here, but not for long. Okay, so this is what we did in Copenhagen. So after Copenhagen, we came back, we like chilled and we decided to redevelop all the structure in Python. Why? Because we wanted to limit the amount of languages used by like our crawlers core. And by doing that, we made some new discoveries on the way and we improved the performance of the draft even more. So you have the source code if you want to check it. So here it is a bonus section, I will just go fast because we don't have time. So what we discovered is actually that single character tree is slow because the stem level is better. If you store like .com or Google or something like this, instead of a single character, you compress a lot of space and you go faster because the structure is even lighter. And voilà. But the issue here is that we had to find a way to store viable length stems because a character is quite easy to hold. It's just a byte, whereas a stem could have like a 16 byte or even more. So we designed the thing to do that. So at the beginning, the results were bad because we had linked list. In the tree, the children were organized as linked list. So you can go beyond 256 children, but here you can go like 1 million. So we tried to do something else. So we organized the children as binary search trees. And this is actually another structure which is called a Turner research tree. So look it up. And after what we tried to auto balance those binary search trees because notoriously those trees can like degrade to a linked list, but it did not make for anything because like reads were slower, rides were slower, because actually like the order we insert the thing in the tree generates enough entropy not to degrade the structure. And finally, we switched to using varchars. So reserving one byte, you know, the length of the string. So we don't have to trim null bytes on the thing and we doubled performance. Okay, so here we are now. So we went from 45 minutes to 20 seconds to compute the graph. And actually now, which is normal, the web is the bottleneck again, which should be the case in almost every case when you do this kind of stuff. So the current version of the crawler hive uses this index in production today. So it works. And the final maculpa. So yes, we probably use the Lucen badly. Yes, we probably use the Neo4j badly, but when you have to twist the system that much, so meaning that you have like to tweak the internals or maybe add a stored procedure or so on, aren't you in fact developing something new? So it just like, just to say that you can like develop a custom index and it might be a good thing to do. But of course, we are not expert in this subject. So we are confident that we can further improve the structure and we are confident that the people in this very room can help us do so. So please bash our ideas and tell us, we can do better and in a different way. Thanks for your attention. Cool, any questions? Store this index in a file or do you write your binary data directly or do you use something intermediate to store three byte structures? Yeah, so we just like use two row files, two row binary files and we write in it like using the file system APIs and that's all. Per corpus. Per corpus, we split things. We have this for each corpus. As the, like the disk space used is really, really low. We don't have to like shard or split the things yet. Maybe in the future, but for now, yes? Is it BPM or whatever? Because I don't know what it is. Okay. Is that? It can be arbitrary bytes and the data can be arbitrary bytes and it's quite fast. It's in 30 years old. Okay, so it's like better in the max for binary blocks or is it something different like a hash map? I'm not sure I get it. Okay, an API in C is new indexes file. Okay, just because we didn't know about it so we like did it from scratch. In Java. No, so the original prototype was in Java but like our actual implementation now is Python. What? Your files are completely binary and platform dependent. Binary, but not platform dependent. I mean, it's... Probably, yes. It's more intense than this. I guess so. Okay. I'm not sure though. Can you speak louder? I didn't understand everything. So if I understand correctly the question is everything, I mean is the whole software relying on a single machine? Currently, yes. And we installed it. Now we have an easy install with Docker so it installs everything in containers. But you could, if you want, in the configuration, if you do the manual install, you can set the crawler on a different machine and use, I mean, but the data itself will be stored on the same machine as where the process is happening. Yeah, not the call can be separated. I guess also that for now, we didn't have like corpuses that would overcome the storage of a single machine because the kind of codes that are performed there are quite qualitative, like you have a researcher that will select finally the things he wants. And so usually, as a human is involved, you won't have too much data. So for now, it's not possible to shard the thing and have a really, really scalable crawler, but that's not really the point, I guess. Any more questions? Well, thank you very much.