 So, my name is Geoffrey, thanks for inviting us at the CSVConf, which is a very great conference and we are very glad to be here. So today we will present a first presentation around the theme, solving some of the mysteries of open data. But first of all, we will talk about ourselves and where we come from. So, we are members, Anthony Pavel and I, of the department, which is named Etalab, which is a mission under the French, the authority of the French prime minister. And there is many teams inside this department, like an artificial intelligence laboratory, but also a team which is called data.groove.fr, which is the name of the French open data platform. And the French open data platform is addressed to every French administration that needs to do this data in open data. And as of today, we have more than 36,000 data sets and more than 200,000 resources. And by these different resources, we have many CSV files. One of the specificities of the French open data platform is that anybody can deposit data on the platform. So the administration, of course, but also every citizen that needs to share data, it's possible. So, as you can imagine, the data that is published inside the open data platform could be structured, which is very much appreciated. The administration, which are publishing structured data, have often a big IT department, which refer to a specific data format or data standard. But we also have published many data that is not structured. Why we have that? Because we don't want to prevent data to be published. If we have data, even if it's not structured, we can allow the possibility to publish it. But as a result, we have many data quality issues. So I will talk about these data quality issues for the presentation. And how can we explore the unstructured data? So how can we explore unstructured data? This is important for different users on open data platform, which want to know what is inside a specific file, a specific resource. And there is many tools in the literature that is existing to retrieve simple data types, like if a column is an integer, a date, a string. For instance, we have tested recently a Panda profiling and there is a session tomorrow, which will talk a little bit about our experiment around these tools. And this is a great tool to find simple data types. But we also would like to inform users about complex data types with a business meaning for the different free users and for further reduces. What is a complex data type? A complex data type could be information related to a concrete thing. So for instance, the identifier of a company of a society, it's complex data types. And we would like to retrieve and to be able to access to this data and to be sure that a specific document is talking about information relating to a specific company. There is many complex data type, like information regarding a locality, like latitude, longitude, or for instance, zip code or region name, region code, department code, et cetera, et cetera, which is related to French localities, for instance. And more generally, we would like to be able to retrieve some common codes used in administration in French administration. For instance, we would like to be able to retrieve identifier of French health care establishment, which we is used to be able to analyze data around COVID-19, for instance. And so many, there is also a different kind of tools that is existing. And that is the beginning of our thought and the development of a specific tool that we called CSV Detective. So we will talk about this tool next and how can we use it and how does it work and how can we evaluate it. So for that, I will let Anthony speak and take the next slide. Thank you, Geoffrey. Hi, everyone. So yes, I will talk to you a bit more about what is CSV Detective and how it actually works. So first, CSV Detective is a Python package made by Etalab. Next. And what does it do exactly? First step, it manages to get some classical information that we are used to have, like parsing information, encoding headers, separators, and so on, that are specific to the CSV file. And second, what it does, it is what is interesting for us here, is that for each column, it will detect a kind of likelihood score for each complex data type that CSV Detective can detect. So for instance, if we have a given column in a given CSV file, we will have some things like we have a 70% likelihood that this column is a country name, for instance. And on the right, you can see some example of the types that CSV Detective can detect. So we have very common types that we could have in other countries like latitude, longitude, JSON format, et cetera, et cetera. But we also have very specific French format that must respect a very specific format. Like here, we have Syran and Syred that are the French companies identifiers. So we have very specific rules to be able to detect this kind of complex types. And in the end, all this is returned in a JSON-like format. Next slide. So I've been talking about likelihood score, but in fact, it's not just one, but three different scores that we have. Because basically, we try to leverage every information we have in the file to be able to identify what type, what complex type is in each column. So the first one is the field score. So it leverages what is in the content of the column. So I won't detail because I will do it on the next slide. The second one is what we call the label score, which is based on the header. Because we know that usually when the data is quite well done, the data type can be more or less detected, or we can have some clues in the header, so in the column title. So here, for instance, we have a specific list of column title examples. And if there is a perfect match with the list that we have, we give a score of 100%, so one. And if the header of the column only contains one of the words in the list that we have, but is not an exact match, we only give a 50% score. And the last score is a machine learning score. So we designed an algorithm, a machine learning model, thanks to annotated data to be able to detect automatically what complex type is in CSV files columns. But at the moment, we mostly rely on the first two ones because we need to still work in progress, so we need to improve it to actually use it afterwards. Next, and just one word about the score. Sometimes we also combine them to get better results in the end. So now just an example, a very concrete example of how a CSV detective can work on the field score. So it's a score based on the content of the column, so the actual values that are in the column. So for instance, here we have a column that is called commune inse, so we try to see how we can match the complex data types that it's called, it's called commune inse, which is a kind of official zip code inference for cities. So here you can see that it analyzes the first rows of the column. And for each element of the column, it will try to match with some specific rules. So for instance, in this case, we have first steps that is checking, thanks to regex, that the content corresponds to a given format. But we also have, after a second step, that matches the element with a comprehensive list of all possible zip codes in France, because not all numbers that matches with the regex can be actual zip codes. So that we have the most perfect match here, the most specific rule to detect the code commune inse. So in this example, for example, on the left, you can see that three of them, they don't respect the specific formats, the specific rules that we set. So we have only eight elements that respect the actual format, eight over 11. So the final score will be 73% in this case. And in real life, of course, we analyze more than 11 rows. So this is how it works. But the question that comes after, of course, is does it actually work? And what we do to actually evaluate how it works, next slide, yeah, thanks. Is that we compare what we have? So we apply CSV detective on the CSV files and CSV columns that we have on data.gov.fr. We compare the results, the predictions, from a CSV detective with manually annotated data that we annotated ourselves by human beings. Then thanks to it, we calculate some performance and data we show you afterwards to identify flows. And thanks to this course, we can correct the detection method, the rules that we use to try to improve CSV detective. And we do it again to check if we had an actual improvement of the Python package. So we actually used two kind of methods. The first one is a numerical method, which is very close to machine learning method to evaluate classification models. We use precision and recall. So precision answers the question, when I predict a given complex type, what is the likelihood that my prediction is actually correct? So it evaluates prediction quality and the recall. It actually evaluates. It answers the question, what is the percentage of actual columns that were annotated on a given type that were correctly detected by a CSV detective? And it evaluates the prediction comprehensiveness. In this case, we don't choose accuracy because it's not really relevant here because the main reason is that we have unbalanced data in our case. And you can see on the right that we have mostly good results, except for specific complex types like address, department code, city name, and so on. Just to give you some examples. We have five minutes. OK, and next, the second method we use is the confusion matrix. So it allows us to be able to detect what complex types, CSV detective, confuse with other types. So for example, we have here the region code that is often confused with the department code. We have troubles to differentiate latitudes and longitude. JSON types are often not really well detected, et cetera, et cetera. So now you know how it works. And I let Pavel talk to talk about the future of CSV detective. Yes, so thank you very much, Antonin, your point. So as you heard already, this is an ongoing work. We have some plans to try to improve it in the short term and in the medium term. We have these improvements designed in a three-axis fashion. The first axis would be metallurgy improvement. Of course, we would like to improve the model, the system, with how can we get more information from our columns, from our CSV files in order to improve the performance? How can we mix the different scores that we use in order to show to the final user or to ourselves why a decision is taken and how can we weight different scores into a best optimal score? We would like, of course, going forward with machine learning, but till now, we still have to annotate more data in order to make it more robust. According to new features, we would like to, one, we would like to detect multiplex columns, that is columns that contain different types of data. We'd like to generalize to more file types going forward, going farther than CSV files to Excel files, for example, or proprietary files. And regarding the open data ecosystem, we would like to, mainly, we would like to allow for people to create their own rules and employ their own rules to generalize the use of CSV detective. Right now, we work with French data, because we are a total of, and we are working with French government data. But of course, these two could be used for several use cases, several other open data platforms. Next slide. And so finally, to sum it up very quickly, you have to keep something in your head before going to sleep. We created a tool that takes a simple CSV and gives you a sub to what type of data is within the CSV. We believe it is an important task to further downstream data cleaning tasks. We use rules, and we use the heaviness of these columns to determine, right now, to determine the type of data that we are treating. And it's not easy. We have some challenges. We have a lot of rules, and this is kind of getting non-maintainable, so we have to find a solution for this, how to manage all these rules, how to address this problem. And of course, data is dirty data. Of course, this is the lesson of our system, to treat dirty data that is lying to us that it's not having what it says it has. So these are our main challenges. Next slide, Joque. So thank you very much. This is all for our part. You have any questions, don't hesitate. This is our source code. It's open, of course, and our mails. You want to ask a question for the room. Thank you again. We are happy to be here, and we can answer your questions for Joque. Fantastic. Thank you so much, Pavel, Anthony, and Joffrey. We have one question in the Ask a Question section. So the question reads, can you give an example of how you will leverage CSV detective for data cleaning pipelines? Yes. So we can use a CSV detective to be able to analyze automatically every resources that is published inside our French Open Data Platform. And with this, we can inform people and users to know on the data set sheet what is contained inside the CSV file. So that is a pipeline that we can imagine. When someone publish data, a CSV file, automatically it triggers a CSV detective analysis. And the output file design format will be able to inform the users with a specific interface about what there is inside the data set sheet.