 Before the end, should I show you? Yes, you have to cut me. Okay, welcome everyone. I'm really looking forward to Paul's presentation on using graph databases for history analysis and visualization. And the stage is yours, Paul. And I'm really looking forward to the talk. Thank you. Well, thank you for having me again this year in the Graph Dev Room. I'm really happy to share with you our experience in using networks in many different means to study 18th century French trade. So I am Paul Girard. I am part of the Media Lab of Sciences Po, which is a research lab dedicated to digital methods in social sciences to make it short. And I worked with Guillaume Plick, who is in the room and who will be talking tomorrow in the JavaScript room about memory structure. And we should see that. And we've done this work inside of a research program called TOFLIT 18th, which has been financed by the French National Research Agency, ANR. So first I will tell you about what is 18th century French trade and what is interesting. So basically France, as a state, has started to compile statistics about its trade since 1716. Yeah, well, the Bureau de la Balance du Commerce did that. This is the organization who actually wrote paper reports about what were imports or exports of commodities from France or to France from foreign region. So basically we have in French archives, many different French archives, you can find those volumes, archival volumes which describe trade at that time. This is what those volumes looks like. So it's a handwritten paper. Basically it's an account book where you have goods, products placed on the earth. And we have both information like volumes and price. I'm going to tell you how and why we used network technologies in pre-all to create a research instrument based on the transcriptions of those archives. I've just showed you a picture of. So basically what we mean by a research instrument is to be able to explore visually what was the French trade at that time. So our corpus, our data, is composed of more than 500,000 nearly trade transactions of one community between a French local tax CD and two from a foreign region country. So we have all those data along the 18th century. So we have our data resolution in time is yearly. We have those flows between French cities and countries. We have 547,000. And then we had to design a classification system to reduce the heterogeneity of the commodity names. What do I mean by that? Well, this is the top 50 products names that we can find in the source volumes. It's only the top 50 on 55,000 different names. If you look at that, we have one issue. First one, which is autographic clustering. So we have wild olive, olive oil written normally and then written with a semicolons just in the middle. I will talk about this semicolons later in our presentation. So we would like, actually, when we analyze the data as such to be in the same category, please. The second kind of categories we would like to create is like more likely thematic clustering, where, of course, we have Odovie, Liker and Pierre. Okay, sorry about that. It's in French and it's in old French. So it's basically alcohol-ish drink stuff, right? And we would like to create this category. So this is like more thematic issue. To tackle those two issues, we had to design a classification tree. So it's not one classification. It's a tree of classifications. Why is it a tree? Because we need hierarchy of classifications. Because we need progressive aggregations. We have the sources and then we want to aggregate a little bit, then a little bit, then a little bit. And maybe you want to stumble upon like five categories, because we want like a product you can eat, a product you can burn, and kind of stuff. But we also want concurrent classifications, because we want to be able to work with alternative ways to aggregate these informations. Because, so basically what I showed you is our classification tree. So on top of it we have the source. Then we built an orthographic normalization classification, which rule is really simple, it's like same world, different spelling. Then we have the simplification one, which is like different words, same meaning. We put them together, right? And then we have a long list of sisters classifications. They are all based on the same root, parent one, simplification one. So simplification one has been designed to be generic, right? But there's the other one out, thematics. So they are really bounded with a research question. And so thematic classification are for instance medicinal products, which is basically we want to be really fine grained descriptions of what kind of medicinal product we have. But all the rest can be in not medicinal products. So it's like really finely tailored classification on one specific theme, subject. So we have medicinal, we have humble classification, which has been designed to make a join with another classification from another research team in Hamburg. Well, targeting Hamburg trade. Canada is about, I will speak about that later, is about which goods were traded between France and Canada and so on and so forth. So our research instruments provides, our goal is to provide exploratory data analysis tool to economic historians. Economic historians are scholars who are either historians studying economy or economic people studying history, right? So demo time. So that is the tool online. Well, actually this is my local version because I'm not crazy. So basically let me just show you classification first. So if you go into classification you can choose products and then from there I'll start with the autographic normalization. So here you can see that clear the buff. Maybe you can't see. Yeah, you can see. Maybe. You have different spelling. Morey seshe. Morey sesh. We have different spelling. Okay. So this is the sources and we aggregate them into one, like 65 different versions of clear the buff tiny in sources have been aggregated into one group into autographic normalization. And then we can also have a look at simplification. And here you can see that, for instance, we have 23 different items in autographic normalization. Right? In one step. Which aggregates much more different ways to talk about ground from. So it's not only autographic differences in the same words. It's different words. Sorry. Talking about the same thing. So if the question will maybe someone upon your mind, how did we did this clustering? The answer is by hand. Thanks to Pierre Gervais. And then we can also have a look at the Canada one. Just for fun. Oops. Canada. So basically we have three, four different categories. It's like it's not a Canadian product. It's definitely a Canadian product. It's maybe a Canadian product. And we don't know. And the funny part of this is that you can also use this classification system to actually check if the guy actually wrote this classification did the thing right. You can do that by choosing that you are going to project the Canada four categories, not on the simplification, which is just the classification parents, but to the roots, to the sources. So here I have all the different versions in the sources that are definitely Canadian. And you can then check what do we actually, we can go back to the source, which is for historians very important to check how aggregation has been made. That's a mandatory to do research in a way. So based on source classification, we can then leverage is a lot of different exploratory data analysis tools like that one. For instance, here it's simply how many data, how many trade flows do I have by year on the different direction, direction being the point of trade in France. So we see that on top, we are Marseille, but we are so not, which is really important. The order sought here has an issue. There is a GitHub issue on that. But let me just show you if I can filter. I just want to project on this. Only the definitely Canadian products. And now we can see that actually the order changed. The more important direction, place of France trading with Canadian or trading Canadian products actually is La Rochelle in number of trade flows. So this is kind of a hint, but we can go further because we have time series. Oh, actually, so it's a reduction application and state have been kept. So I actually have the choices I made in the filter while preparing the demo, which is quite nice. So here I use those filters to show you two curves. One which is only Canadian products with La Rochelle. This is the brown curve, the black curve. And the purple one is the same thing, like only Canadian products going through Marseille, which is another big, really important port in France. And as you can see, we have a date here in which the curves flips. So La Rochelle is upper Marseille for all the first period and then goes down under Marseille ones, both in terms of a number of flows and value of flows. And this year is specifically the year when France actually lost Canadian colonies to British Empire after the Seven Years War. So this is kind of analysis, exploratory data analysis we want to provide to economic historians. So that was a demo. Sorry again for my voice and everything. One important point. What I just told you about Canada, actually I have a talk, a paper about this much more precise that I submitted to a conference called Digital Humanities, which will happen in Rotterdam in 2019, but this paper is under review, so maybe I hope to tell you this story later. So actually to do this we used a graph database Neo4j to modelize our data as a trade network where trade flows are edges between trade partners. Why? Because trade flows actually form a network and we want to be able to dynamically aggregate flows by any classification as I did in demo. I want to be able also to change the classification without having to re-index my data. And the last point is because we used Neo4j before and with pleasure. You can see a previous talk we've done for them at three years ago. So this is our data model. So basically on this we have the flow node in the center and this branch is how we aggregate country names through our classification system. This branch is about how we aggregate products through our classification system, sources, operator and direction. So till somewhere in 2015, 2016, we stumbled upon a lot of Cartesian products while trying to query our Neo4j graph we were not very clever and we didn't have this really nice feature in Neo4j Explorer which like show you when you do stupid stuff. But we finally stumbled upon a solution till November 2016, the 23rd. So this Cypher query is basically very bizarre. So what I do here in the first match, I just want every products which are Canadians. Then I want every flows which are with those countries, these countries. And then here I'm using index of nodes to select my flows. I collect products, I collect countries and then as a developer I actually use index inside the flow nodes. That's bad because our specification was to be able to change classification without having to re-index. Our thing is like here I'm using indexes, indices that are in the node properties. So this not that good solution implies to index the product names, the node products value inside the flow nodes. Which means that I just like use indexes, indices, in the nodes not to have to do the traversal. Okay, that's only at the source level, but still that's an issue. It's a bad background as I just said, leveraging leucine indices hidden inside Neo4j instead of using graph traversals. That's not good. And the problem was what? The problem was our flow node. We couldn't go through the flow node. When we were doing this we had Cartesian products. That's bad. This flow node is actually our top degree node in this scanner. That's the central point. If you want to use classification to go there in the flow and use another one there, you have to go through it. We didn't find a way. Teal. I realized that what we needed, what is this node? It's actually an hyper edge. And once you have this key word and you look for information about Neo4j and hyper edge on internet, you have to really want to find it. You can find it. It's here. If you open this, this is the hyper edge documentation of Neo4j. I read that. At that time. When I read that I said, actually I'm just freaking stupid. I learned this. I learned how to use it on November 23, 2016. And why? Because I was preparing a talk, a proposal for graph day room 2017, which I finally didn't submit because I felt really stupid. And this commit actually archive this moment of glory, of revelation, because we changed all our query planner using hyper edge. What is hyper edge? I mean, you just have to do it right, basically. So basically you do a match and your match has to declare the whole root to the central hyper edge flow nodes. And you just pipe them. I want a special direction in France, and I want all the flow with them. I want one specific classification of products, one specific item, exclusively Canada, and I want all the flow that matches this. Once you've done it, you're like, of course. I want all the countries that are in this classification and I want only source sources. Voila. And in the ware you just put the params of your match. Okay, actually one, an index-based memory structure like Elasticsearch would have done the job, actually, because we didn't have time to implement classification modification user interface. So our main feature, which was so important not to have to re-index, was really crucial because we wanted the user to use our web application to change the classifications. So we didn't want to re-index because of user's actions. What you can do with the Neo4j database is just add nodes or remove nodes in the classification part, not in the flow part, which is the source. Actually, two. Oh, yeah. We didn't code this user interface, but we should one day. Actually, two. Modifying a parent classification is nice to update its children. I have a tree of classification. So if I modify the root, everything under the assets has to be re-wired because I can have new nodes, new names. I can change the name. It's a mess. So together with Guillaume, we wrote an already sacred algorithm, which I will be, it will be very difficult to explain, so I will not. And this algorithm has been designed to require the tree using set theory. Yes. Voila. Explore. Okay, last part of my talk and I stay nice to go. That's good. We did all of this because we wanted exploratory dependencies. So let's talk about this. So we are using JavaScript technologies. This cipher, I will talk about it just later, is a piece of software from Guillaume, which allows you to build cipher queries in JavaScript. Beast queries. We use Express node web framework, web application framework, but we surrounded it by Dolman, another lead from Guillaume. Graphology, it's a network statistical JavaScript library. We heard about it two or three talks ago. React for the UI, but React with BalBab as a state machine. BalBab is like Redux, but before Redux, by Guillaume again. And Sigma. Sigma is JavaScript technology to do network visualization. We talked about this three talks ago. So thanks to Guillaume and Alexi, we actually wrote all the BalBab, we wrote all the BalBab parts and React. What this cipher, this is when you need to actually create in JavaScript a complicated and dynamically defined query, cipher query. It's just like you have queries, you can put params, you can put and and push and everything, and then you build and you send to, this is a long code. You don't want to do that, right? But sometimes you have to. And then you do that, and then you do that, and then you do that, and then you database cipher query build, right? That's really cool if you are in JavaScript and you need to use the cipher. So to sum up the technological part of the talk, we basically homebrewed open source data science tools. Thanks to Guillaume, and so it's on our GitHub. And soon we will also release the data on a data package format. So GraphModel is not only a convenient way to store and query our data, but also powerful visual objects to explore French trade geographical structure. So here I'll tell you how we used network not only to store data but to analyze them and visually. So I will not have time to do the demo. So this is a picture of what you can achieve if you select these filters on our websites. So this is the locations that I've used, in which you have a bipartite network where you have two types of nodes. The first one is point of trade in France and point of trade outside of France. And then the question here is like, why didn't we use a geographical layout? I mean, all those nodes are actually placed. First, because then you need to geolocalize everything and we have points like north. That's an issue because in history we don't really have necessarily precise or stable level of precision of data. So that's an issue for the first one. But more than that, and also it would require for as much more work. That's another important point. But the last point is that the excuse we used was the two first ones. The excuse is that actually when you have a network of flows on a geographical map if you put the nodes precisely on the map you will not see the network structure. You will only see the geographication spreading of your points. That's an important information but you don't see the structure. You see the geographical layout. And what we wanted to study in our, well to let our users study is more likely how the trade networks, the trade flows actually change the geographical layout to show what is the trade structure geographically speaking. For instance here. OK, Marseille is close to Italy. OK, yes, of course. But then you can see that Holland is a very important partner for quite all of France. Not only North part of France. OK. This is basically something I've developed on Twitter recently. And actually just to show my point it's a geographical map with a network and then the map is distorted by the network structure. That's exactly what we want to do to see how the geographical layout is actually modified by the links between the places. So that's not from our team, right? It's a shifted map from those guys, Till Negal. There is a link to the paper here. Second network analysis we've done is trade projects specialization patterns. OK, I have to be quick. So if you have a look at the sources you see a lot of semicolons. I showed you one before. So semicolons actually represents handwritten acolyte brackets that have been used in the sources because they wanted, the people who were writing those reports wanted to spare time and ink, right? By using those brackets by using a general specific aggregation system. So they would like to write wine, a huge bracket, and then from Burgundy, from Bordeaux, from generic to specific. But two issues. First, manual transcriptions we did and 18th century reporting practices those guys at that time did. So those practices of writing were not applied equally. Writing practices are never applied equally. In my experience. So we decided to replace semicolons by some glue words when aggregated product names in our autographic normalization classification. But we computed, we compute actually a product terms co-occurrence network. So we take a name of the goods like vain, de Bordeaux, de très bonne qualité. And then we put a link between the first word and the second one between the second and the third one. So it's not exactly a co-occurrence and a network that's almost. The code is here. This is where we use the graphology. And this is what you output when you look at this is our network of co-occurrences of trade from La Rochelle from La Rochelle to La Rochelle from La Rochelle to La Rochelle from La Rochelle to La Rochelle from the series of trade from La Rochelle exports from La Rochelle between F 1720 and 1729. And what does this network tells you? Even though we used a classification this application here. By doing this network we just provide you a clustering of terms which compose all this trade. So you select a trade La Rochelle exports source dates. We take all the names. We do that and then we have the coloring is from Luven clustering algorithm detection. And then we have the Luven communities of product terms. And then you can see that this is a thematic map basically. We have woods. Planks of woods. We have metal stuff, a fair, quive. We have one coton, like a len. So fabrics basically. And then we have like animals that you can maybe eat. And then we have skins. So this is an automatic bottom up thematic ontology by using network analysis. And I think we think this is cool. Actually one of my wish is to one day use a stochastic block modeling we are using on other networks as a clustering algorithm not the Luven one to analyze this bidirectional genetic specific terms relationships. I don't have time to explain. Final demo I will not do because that's the last slide and I don't have time anymore. But you can actually compute those networks using a long large list of filters to fit your needs. Okay, I will finish that first. Okay, my take away and then I'll demo until you die. My take away what we need is to be able to change classification because classifications are very important for social scientists. Very important. This is how we analyze. By hand. Yeah, by hand. Maybe with some machine learning algorithm but by hand at some point. Without having to re-index because this takes a long time and this is heavy and we don't want to do that. So it's hard but it starts both on data modeling parts which database you're going to use and how you're going to traverse this and on the user interface actually we did that one, we haven't done that one. So even though you have a really nice graph database if you don't have a very nice UI that requires not to have to re-index maybe an index will be sufficient. So that's a very important point also. When you design a product and especially a visual exploration product you need to think about how your data is going to be mobilized how your data is going to be queried how your data is going to be written at which frequencies in by who and which visual models you're going to use. Once you've done that, you know which database you have to use. Hard. Of course having graphed database with important point with a documentation page about HyperH can help you. And so Hasnir4j actually helped me when I was in this pair. Okay, this is my merci. My merci to you but also to those guys those are the economic historians. Pierre Javert is a historian part. He actually did all the classification by him not all, the first three more important. Louis-Charles and Guillaume are our colleagues on this project about studying this. So if I have some demo time let's demo then actually I was hoping not to have to do that because I'm not sure about my data. So the location networks removes this filtering time. So here I'm looking at grouping classification because country names are also classified I just put that on the carpet. I only want local sources I want only trading goods with Canada and I want total value import plus exports so this isn't what we have. Of course it's a sigma so you can you can do that you can do that also Okay, that's really cool if you're lost use this Oh yeah, that's really nice also I mean visually you need to be able to choose which threshold on the labels which now we have much better but we haven't implemented it you can choose how many labels are going to be displayed and which size they will take which is really important. You can export as a GP file now it's a CAZ but you can import it in a GP still and then the product networks here we are okay but I will do the Canadian one just to so I choose here the Canada the Canada classification local sources only Canadian total everything please and then we have one community which is all about morue, codfish which is like seché, dried but we also have like skins skins were really important trade materials from Canadian inland actually we have skins of squirrels yes, this is squirrels in French caribou, vison, loutre but you can see we have a level of details in the last notes yeah, well I think I'm good with that in the website you have three different visualization time series so projecting through time projecting through space projecting through semantics we have the metadata that helps you out knowing which data we have when the classification I showed you this is a complete list of sources we've used for archive people and glossary because I mean it's completely completed so we have a lexicon of different French names sorry, French names and explaining you what that is that's it I'm good if you have any question we will be very happy to answer them thank you so much, any questions for Paul please if you have any questions I would like to know how do you digitize all the important documents okay, digitization digitization there's two steps, taking pictures first so researchers goes into the archive with a camera and take pictures of all the pages put them into an art drive then go back in the lab and then here you have two paths one path which is like hiring a transcribing company which will do that for you so you send them pictures they send you back a transcription so the other path is hiring interns to do the same and actually this path is usually used after the subcontracting company to check then we have Excel files so the Excel files then are converted into CSVs so CSVs are put into a GitHub repository and then from there, from this GitHub repository digitization, we compile the data to build the Neo4j database the graphdb folder which we then have a continuous integration system both on code and on data so if I push my new data on a branch on the GitHub repository the processor will reload the new version of the data yeah, well, about OCR this is a part I haven't done anything so that's my partner Charles who did that and I think maybe the transcription company did use some but remember so my two cents about this there is very interesting new technologies to do this kind of OCR based on machine learning where the transcription person needs to know how to read this icon so it's like paleo to graph I think the name in French so those guys need to train a model to recognize that first the structure is column based there is source brackets so there are numbers and actually we want some numbers to sum up because sometimes they don't sum up they should recently about I can't remember the name but I can show you there is one piece of software really nice to do that we haven't used it okay so here are two things first question is like what is the size of the co-occurrence network and second question is do we extract embeddings okay so first question about the size the worst you can do is working with the sources which I think doesn't work for some weird reason but we need to try and this gives you a huge network voila this is voila, it doesn't work I don't know why it should but okay the second most important network is like we have 27,000 27,000 product names and this will end up with will end up will embeddings I don't know what you're talking about I will be really happy to talk about this I don't know what you mean by exporting embeddings but always for sure you can extract the network from here and then you do any fancy network algorithm techniques on your preferred software but you can still see it if you're patient enough and that's a big network I think we can say that's a big one we should have like the number of nodes and everything but I don't know it doesn't show