 Thank you, Victor. Thank you, Mohammed, for inviting us, or for inviting Tomo. I think we might have invited me along, but thanks for letting us present our project. What we'll do today is we'll present some reflections on the nature and use of big data in a big interdisciplinary project. This project is the commune project. It's funded by the INR. It's led by Tomo, by Isabel Segi and myself. And you couldn't have done better, really, as organizer in placing us or in organizing the sequence of papers for this conference because we open the conference and we're closing it. Our project is clearly a child or a sibling of the British project. And what Kampop has done in sort of decades long production of the English and Welsh state infrastructure is very closely related to what we're trying to do, which is both inspiring us and so there's an interesting parallel in the two projects and seeing how starting from scratch to a certain extent a bit later on changes the way in which some of the things we do over time. Okay, so what we're going to do today is very simply try to offer this sort of a set of general reflection on the nature and production of big data in the context of this experience. The first point in what we're going to start with is really to think about how we construct the data and really thinking about the sources that we use or at least present you the sources. We won't have time to really go into the details but obviously any question later we can answer later. The project has really two major components. The first one is the historical administrative units and the second one are the transport networks at least those units. And we will briefly present how we construct those two to start with the the project. So the administrative side of the project emerged from two existing data sets the first one with the administrative history of all French communists in since 1801. Sometimes some people refer to it as the HAC data set or HAC data set was created by Isabelle Segui by Claude Motte and Christine Théry. And the second one is sort of an enhanced version of that enriched with data for the revolutionary period and linked to some points on the Cassini maps and the modern administrative units and that's gave birth to the Cassini website that everyone probably knows about when anyone was interested in French history knows about. None of these obviously have GIS boundaries linked to municipal units and that's really where our project started. What we wanted to see is to be able to represent in space every one of those units in obviously and see exactly where any of them were in a different point in time. The second problem with those data is that they really see the history of communes from Paris. Well, we also have that unfortunately I didn't put on that slide is Victor's your work Victor obviously on the third Republic, which is a first step obviously in the territorialization or the mapping or the or making it possible to map different administrative units in the past. So all this together, it leads to what we call the inner commune data set. And what we are not doing is we're completing those data with archives collected from 95 Department of Archives in France, each time collecting textual and cartographic evidence in order to complete and those data and reconstruct the boundaries. So these, this is really very much the structure the very simplified structure of the data set is certainly not a UML complex UML, you know, scheme that you can see here but at least it makes it clear what it is. And the way is it how is it done well we have over 40,000 communes, you know, between after 18 or 1800 20% of these change or we have roughly over 8008,000 changes to documents in 1801. We have a few thousands more in the in the revolution period but these we don't even know how many were supposed to be in the archives all this year but obviously with the pandemic, we had to revise our plan so we'll figure out the numbers soon but not yet. What do we do. Well, for the for when communes were simply divided we can use the modern quasi metric precision of the data from the IGN to recreate past boundaries and we have excellent accuracies. In the case of coming merging, obviously we need to find the cartographic evidence of the past boundary of the commune and that's where it gets more tricky. In fact, second rule appeared, we can use scans of IGN maps and our partnership with the IGN make this possible so we just have to obtain the right map and just redraw the boundaries. For the 19th century, and the second, especially the second part of the 19th century we have military maps that we can obtain the same way. In the period, we have to rely on either cadastral maps or material collected from the archives. And, you know, this is what it looks like, we have a map, and we simply georeference it and redraw the boundaries. In most cases, the accuracy is quite good and we always measure it for each one of our reconstruction. But not always, because sometimes there is no map, but we only have a description of the confines of the commune or the boundaries. When a commune was set up and we have to work with this textual document in order to construct a boundary that obviously is not we could not find any documentary evidence for exact its exact position. The result is this. So here you have a commune in 2019 and you have the same commune 100 years earlier. So basically we have for every year in the commune data set, a set of commune, which is related or linked to the polygon that represents its exact space, its exact area at any time, at any point in time. And what's the benefit of this to a certain extent it has almost as an antiquarian aspect, which is we want to have exact boundaries for each commune. Yes, that's true. There's an element of we want this to be as correct as possible. But there's another element is that as we mentioned in his presentation is that obviously by having a set of administrative units at any point in the past. It means that we can link to that set of administrative units, any data created with the units of that period, natively if you want without having to reconstruct to fiddle with the units in the data set other than matching the names. So that's one thing, but more importantly, and that's also point he made is that it becomes possible to link any data through that interface to any other data. And that's in on that's at a basic level. So for example, it makes possible to relate to data set that were that, you know, I give a different data for each commune to assemble. But also, it allows two different set possibility of connection one in diachronic way. So it makes possible the use of time series on those very large data set because we don't have the problem of aggregating different units we can specialize the data so obviously that we can link date data sensors for example for different years and always be consistent in our use of of administrative units. So we have not that we don't have that problem. But also, it's possible in a synoptic way, because we can relate data to any cell to a data to any other cell through the relationship between those points and units through the transport network that we are building. So that means that for each point in time, we can, at the same time examine the set of relation between each point in space, and we can then add a temporal layer to this, we can see how those relations evolve through time. So I'll let Toma now explain how we're building the transport network to give you an idea what it looks like before we go back to some reflections and what that means in terms of the scale and the use of that data. Thank you, Alexi. Get the link is based on the connection of the municipality through the networks of the barrier study. So this is the transportation in the center of my preoccupation. And then we are going to build a multimodal network from 19,000 to 20,020 including waterways, railways and roads. Next is Alexi. This multi source approach will be based on existing data. We have the postal network, we have the Cassini network, and also the French railways developed in the past project. And then for the information that we need to be completely created, we are working on the waterways actually. We are working with a literature and the web site. This is the site website of reference in the waterways in France, and the next year will be dedicated to the digitalization of the roads from the military maps, and but also from the mission maps from the 1910 and 1930s series. These sources will also be used to assign an average speeds, and we have already done this for the pedestrian trips for the stagecoach for the railway travel, and we'll still have not to to make an assessment of speeds over time for cars, tracks, and navigation. And this geographical database will be made up of different unimodal graphs for pedestrians, navigations, rail, and car. And then we will merge all these different unimodal layers into a graph that we will now call the multi graph. And the goal of the semester graph is to connect the different modes with each other for accessibility calculations for the work career study. So this is your time. Thanks. I'm sorry I was already on wine as you can see so probably that's why I'm a bit slow. And the so as you can see we have because of those two aspects of our project we're collecting a really large amount of data. So we have an autographic in order to find the boundaries that could be textual in order to find elements and mentioned administrative element about the commune themselves, but obviously that that is also all those elements about the transport network that tomorrow is so we've very quickly encountered a problem is how to do this, because given the scale of the collection that we wanted to, you know, harness in that project, we really had to think about the best methodology for that purpose. And really obviously when we, we wanted to have to automatize everything. It was, it sounded like the obvious way to do it but obviously, but that's not that simple, because there's a cost to automatization that can sometimes offset the gains, or even outstrip the gains that that they're also linked with automatization. So if we represented a very simply some sort of a diagram showing that relationship. So doing it manually is obviously extraordinarily expensive and time consuming and doing it automatically can be very quick, but the problem is that there's also a cost to validation that could be much higher. And if the data is poor, it could invalidate the entire use of data and therefore that could be completely useless and therefore be very expensive to. So we were a bit faced with that problem where using the existing data and linking it to what we have was, you know, very, very efficient doing it manually was very inefficient and doing it automatically was efficient but not very secure. So we were really thinking about how we could do this. And I'm going to give two examples now in the next few minutes of how we try to do this. I'll give first an example of the kind of data when I saying we're relating data to our commune to our bound to our administrative units I'll give you some examples of what how we extract textual data to do this, and then we'll give you some example about geographic features and in trans for the transport network. I'm going to present here is really the, the consequence of a project that I've been leading with some for a few years with my colleagues in Cambridge Janus that are put us and Oliver done to try to make the extraction of historical data of tabular historical data more efficient that's a project we call a soft for trumsily transcribing historical object with tabular handwritten data. The point is that lots of the historical data comes in forms of table, and those tables, I mean, extracting the text from it, even if they handwritten now is a, you know, is a process that computer scientists have cracked to a certain extent it's not always perfect but it's possible. The problem is that when it's in the same shape of a table, it requires not only understanding the text but nothing the relationship between different bits of the text and that's in terms of computer science is actually more difficult. So I'm going to give you one example here, which I took from the draft military law, the draft military lottery. So in French to lease the tirage, which show consistently between the 1830s and the First World War. A list of around 420,000 men age 20 each year being listed before there was any decision on whether they would join in or not the army so it is a very exhaustive list I mean no one is that you know, no one that I'm going to serve are listed in this in these documents, and I'm not sure whether Leonel would call it a poor source as he said the census were poor source. I mean this one contains lots of different variables name, place of birth place of residence place of residence of parents, height and medical history injuries, whether the person knows to read can read can write. I would qualify this as a rich source but I will want to, you know, make anyone being upset with. So that we will obviously interested in extracting data from this source, and what we did for that we did two things first we elaborate a process of segmentation of data, and then we fit it into this a process of text recognition. So it may be a quick to show you what it looks like but if you have a document like this, we can start first by using some basic sort of extraction of line to try to recognize a basic structure. Then we can annotate those pages if you generally now we are around annotating 40 pages went for a regular table. Then we ask a model of convolutional neural network to try to detect to use that data to detect cells, and we use those cells then by repeating this operation lots of time and correcting itself to create tables which have all their own relationship, and we can extract those fields exactly as you would then work with an Excel spreadsheet basically each box, as its own coordinates, and within those books then we can extract the text using, you know, techniques of called HDR that already are an existence and quite common. And at the end is data that is quite reliable and I am showing I'm not going to go through everything but you could see if you look at the at the endpoint which is the corrected data. The, that the accuracy rate is quite good obviously it's, it will be lower for the last names because they are more variety then can be really really high for questions for like place for places which are obviously much more limited in terms of the number of places. There is a way where we can create data and obviously with the data we have we can specialize or we can project all that data immediately into the communes and the content in which they were produced. So this becomes a very easy because we have all that data that we've produced with the inner commune. The second example I'm going to give you is when we want to use and have this problem of creating data from cartographic material. And then we want the roads, which is what we want at the moment. And we want to work with those it our major maps of these military maps. Well, extracting the roads as segments. First is challenging, but also when you look at the map themselves, it adds other layers of complexity. Those maps are ties and generally they're not stitched together properly. So what we might use in terms of the continuity of segment might be broken. The others vary quite a great deal. There are also lots of different types of line and it's very easy for a computer to confuse one with the other. So you see what I was saying before, in terms of the relationship between automatization and manual acquisition of data and the cost and benefits is something that is obviously needs to be thought carefully. So what are we doing for this where we have adopted a sort of a progressive. We are adopting because obviously this is work in progress and progressive approach to this. We started by doing some road ourselves so that's the, the what you see in the bottom left corner and we are digitizing some roads. And then we harness the power of the crowd and we asked, you know, lots of people to scale up our power and do it with us and that's what we did and you know some selected crowds sourced data acquisition. And then comes the point where we are asking whether we can use this in order to feed into a model of machine learning. And we found two way to do it first a simple way where we are just using that data as annotation and training different machine learning models on neural networks to do this. So what we found for us and this is a big problem is how to relate any segment that we've extracted from a set of map to a segment with extracted from another set of map because we are historians after all, and we're interested in understanding how those segments are related to maps, black box you see Co Vadio here works, where we have created a game or Claire lages and I'll go to have created a game that we're using, which allows us to ask people to spare segments, according to from two different sets of map. So that gives us a second set of annotation, which allows us to use a temporal element, a temporal element of the data set so our data at the moment has two, two aspects, one which is for any set of map, a series of segment we use as an annotation, and then we can feed into a convolutional network to try to get better at extracting them. And the second one, which is a pairing mechanism, which we could use also as machine learning in the future, in order to make those segments relate to each other so we could also potentially create time series of transport networks in the same way we are thinking about administrative units. Obviously that has implication in terms of data and the size of the data. So we have a picture special data infrastructure, and this infrastructure is composed of heterogeneous data that we can formalize in a space time tube in the vertical line you have the space with the local to the global, and you have the localities and the European countries on the top of this cube. And for the horizontal lines, the time scale extends from one years to several centuries. And these space time dimensions are determined by the nature of the sources. And in the next slide please. The geospatial data infrastructure is composed. Sorry. These three axes makes it possible to define the three components of this data infrastructure, where you have the data warehouse for the primary sources. And also the HGIS linked to the DBMS for the structure components, and we will use the intensive computing for the calculation accessibility measures, and we will use also simulation exploration and visualization for the analytical network. So this geospatial infrastructure is a gateway to even bigger data. Actually, we have 50,000 special units and we can generate 2.5 billions of pairs. Soon, with the multimodal graph, sorry, we will have 16 different possibilities and 40 billion of pairs will be generated. And next, then for each years between 70, 60 to 19, we will have 6 trillions of pairs that will be generated with our infrastructure. And at the end of the project, we will combine all these measures with economic and demographic data, and it will present, of course, a lot of data. And in the future, through the URI identity field, it also becomes integrated to any other type of linked data, for example, wiki data, sorry, and the historical gets it here. So we should reach the critical size of the big data with this infrastructure. So with our big data, it requires an interdisciplinary approach based on this different discipline represented in this workshop, economy, history, geography, of course, but we need, of course, the help of computer scientists to manage this lack of information. So we need to challenge three main domains. The first one is the data collection. We will need to add transportation data, industrial and social and agricultural services. We will need to add data processing with a high performance computing, text mining, machine learning, and web semantics. And we will add, so we will use the data analysis, for example, of clear metrics, or space time, economically, or morphogenesis of network with the computer scientists. So this is the kind of framework to add data science applied to the digital humanities that we would like to do at the end of the project. So in the next slide, we will talk about the federal approach that we have developed for the dissemination of our data, according to the data management plan. Our data will be findable with data metadata that will be available in the human infrastructure via a list and the cargo platforms. Our data will be also accessible, primary sources will be accessible with the warehouse data database pender, which are developed in the MSH of Dijon, and special database will be probably available with the cheer orchestra platform. And we are working with the University of Heidelberg to develop an history called trip planner available in the open route services. Our data will be interoperable. We designed a conceptual model based on UML and Alexi will illustrate the case of UK in the next slides. And our data will be also reasonable. This is the case, actually, we are sharing our data sets, actually, with economists, for example, Christophe Lebeck in Bordeaux, historians and geographers in Paris, and also physician for the morphogenesis of network that we are working with them. Alexi. You can see that what we have now is developed something that tell us about each spatial units over more than two centuries, give us a relationship between each one of those units in time in space and in time in space. And there are two elements that are perhaps missing in the usefulness of that data and this was a question Lee had when he presented the English project at the start. And I'm sure anyone who's not French or not a French historian here would have about this is okay yeah that's all very good that's really good but what about how does that fit into international comparison. And here a few remarks on the nature of the data and how this data can be made interoperable in an international context, and I'd share some element of something that Lee already touched on when he was talking about occupation structure. And obviously, Lee said the code that we use for occupational structure is called PST or international called PSTI. And recently, we've been working on how to move away from a strictly anglo-centric approach to occupational coding to make that data comparable in space and this is something obviously which all the data that we are collecting in our coming project, it will be coded in so this will be also the backbone of our comparable socioeconomic data. So the current coding scheme for PST I'm not going to detail anything. But the idea is that we have a there's a six point coding scheme, which allows to distinguish between different occupation based on a series of markers I'm not going to explain any of this. What we want to do here is that new marker so another eight elements to that code, which would make it I think it was Lionel was making that point. There's a coding system context and time aware. So we'd be able to distinguish between two text trains that are exactly the same, but that means something different because they are from northern France out in France from a different period, etc. So that obviously increases the interoperability. But the problem is, the coding scheme in itself remains a almost like a language that, you know, is only useful when you speak that language. What's more useful is having some kind of reciprocal relationship between languages that would allow those any language to be interoperable with another and obviously economists do not use PST I or some might but it's not the most common language and Lee and I and others have been really thinking about how we could do and build on this to develop a more more sort of fluent relationship between those different languages. So we started with PST developed by Tony Wrigley here in Cambridge and in this different versions and in the project that Lee and Osamu Saito were leading in choice on lead on making that code PST I available for the occupational coding of more than 20 countries around the world. So it's partly linked to Hisco, which is a code that historians have been using a lot, but it's not the perfect equivalence anyway because of the nature of the code. The problem is that if we want to have a new code, and I'm going to skip there, what we want to do is to be able to have coding schemes like Isco like Esco, like all the national coding schemes, like the all the industrial coding schemes also available so that all of this could communicate. And so this is something that we're thinking about and we will probably put in a grant application soon to to develop more where we will have an integrated code which would allow also those elements to be available in different languages or coding schemes. And so that shows also how in the production of big data and the production in a national context also need to think about the relationship of any of the nation national data to other existing data, but also other existing data being produced in different countries of the same way so that they become interoperable and comparable. So if you've not been too long, I can't say I stopped looking at the time, but if you want to get in touch. You can just do it at those links and if you want to use data or if you have any questions and data please just drop us an email will be very happy to chat about it. Thank you so much Esco and so any question that you might have.