 So hello everyone, thank you all for being here today and especially thank you to the organizers for organizing this amazing conference. So I'm Julia, I work as a data scientist at Etelab and today I will be talking about machine learning on open data sets. So what's Etelab in a word? Etelab is a French public administration that works on many subjects that relate open data and public administrations and public policy. So some of the main missions of Etelab are to promote open data through the platform data.gov.fr which is the open data platform, the French open data platform. Then to support data-driven public policy and finally through the AI lab to exploit open data through data science and AI. So I will be mainly focusing on this today talking about how to exploit open data with machine learning. So first we will quickly see how there is an actual lack of open data in machine learning. We'll try to understand why this is the case and why we should actually be doing using it more in this area. And so in order to tackle this we created the GMMML which is data repository for machine learning using open data from Etelab.gov.fr. And so I will show you the methodology we used in order to create this platform. So how did this journey begin? I actually started working at Etelab while I was finishing my master. So I was following a machine learning course and naturally I wanted to do some applications myself. So I went on data.gov.fr that as I said is managed by Etelab to look for some data sets to perform some machine learning tasks. And I was actually quite surprised by two things. So the first thing was that there are thousands of data sets on the platform, but only a few were actually used for machine learning applications as the ones that you can see on the screen. And also I realized that when you have such a large amount of data sometimes you can get lost and it can be hard to identify data sets that could really be used for machine learning tasks. So we discussed this with my colleague and we realized that this was actually a much larger issue. So we found that in machine learning research, for instance when comparing the performance of different algorithms, most of the time it is always the same small set of data sets that is used. So you can find some of these very famous data sets on the table I put here. And so you can also see that these data sets actually have some pretty good meta features. So for instance they don't have too many missing values. They have a good size most of the time. And so this led us to wonder if these data sets were actually representing the reality and the challenges of open data. So you know sometimes open data can look like this. So this is a table I put as an example from a data set that is in data.gov.fr. So you see there are a lot of missing values. You have some weird data types. So you have code. And also in the second column you have a categorical variable that actually has a lot of categories. So for simplicity I only put a few of them. But you can sometimes have data sets where you have hundreds or even thousands of categories. And this can be pretty hard to handle. So this actually leads us to the first reason why we believe we should be using more open data in machine learning. So the challenges I just showed you could be interesting to challenge machine learning algorithms and also to help evaluate and compare performance on a larger number of data sets. And then since open data covers such a large number of topics we can imagine some quite insightful and interesting applications for education, for research in various areas or for business and eventually to support public policy. But so if open data seems so great and we should definitely use it more in machine learning than why isn't this the case? We identified three main reasons. Maybe there are more that you can share with us. So the first one that is actually important is that there is a lack of data quality. So in open data platforms most of the time anyone can upload a data set. So a data set can come in various types of formats that can be actually challenging to treat. So we would all like to have an ICSV files but sometimes you could have PDF data that are uploaded and many others that can make the task pretty challenging. Also you can have issues with data content. So maybe you have data sets that are too small to be used to be trained in a machine learning algorithm or that may contain too many missing values. So these two reasons actually lead to a need for a lot of pre-processing and this can possibly discourage people from using open data in machine learning applications. Then we also believe that there might be a lack of communication about these platforms. So maybe people are not always aware of the existence of these platforms. And finally, and this is what we decided to work on, there is a lack of catalogs, a lack of data repository that are specifically specialized in machine learning. So what we did was that we had a look at the most famous existing data repositories, mainly UCI machine learning repository and OpenML. And we decided to create our own data repository for machine learning using open data from data.com.fr. So here's the challenge. We had thousands of data sets on data.com.fr and among these thousands we wanted to identify a set of data set that was adequate for machine learning. So naturally the question was how do we select them? How do we identify a data set that is okay for machine learning? So at first we had a very naive approach, I'd say. So you can see it on the left of the diagram. So we manually selected from data.com.fr data sets that were either popular for machine learning or that we knew that content-wise they were okay to be used in a machine learning algorithm. So we took these data sets. With pandas profiling we generated a statistical profile and then we had to look at each one of them. For each one of these statistical profiles we then only kept data sets that would actually make sense when trained and tested in a machine learning algorithm. So for instance data sets that are an adequate size or that had for instance a variable that could be used as a target variable. And then we automatically trained and tested them using the MLjar library. But you know of course this is a very time consuming approach and we wanted to come up with a faster and more efficient way to select these data sets. So we also had an automatic approach that you can see on the right of the diagram. So we took a sample of the data sets that are available on data.com.fr and we filtered them according to the four conditions that you can see on the screen. Mainly on the size of the data sets and also we checked that there were both numerical and categorical variables and that there was a low number of missing values. Then again we did the pandas profiling and thanks to the profiling we identified columns that were difficult to treat. So categorical columns with a lot of categories, highly correlated columns and unsupported columns such as constant columns for instance. And then once again we trained a set of machine learning algorithms sometimes. So why did we choose to train these algorithms? Because we actually wanted to provide users with all the information they could need to perform a machine learning task on their data set. So let me just show you how this platform looks like. So I'll just change the screen I'm sharing. So here's what the website looks like. So on the right you have all the data sets that are available. And on the left you can select some features that you're interested into. So let's imagine for example that you are passionate about data. You want to do some machine learning on open data sets. So for instance you want to do a regression. So this data set seems interesting. It's a data set about cars pollution in France. And you can click on here. By clicking here you will find the data sets on data.gov.fr. But you don't really want to download the data set and check for each column and see what's in each column. So by looking at here you have the data dictionary. So you can check all the variables that are in the data set, their description and the type of each variable. So this can save you a lot of time and help you better understand what you could do with this data set. Then you know one of the things that you might interested into doing when doing a machine learning task but more generally a data science task is to check some of the statistics of your data set. So Pandas Profiling automatically does this for you so you don't have to compute it on your own. You can check some general statistics here. The distribution of each one of your variables. And also you can be interested about correlations. So you can see that there are some interesting ones in these data sets for instance. And also you can check the distribution of missing values. So you are convinced like this data set seems interesting to you. And you would like for instance to do a regression using the CO2 variable as a target variable. So you might wonder should I do a decision tree? Should I detrain just a linear regression? By clicking on the MLjar profile you have an overview of what would be the performance of each one of these algorithms if you trained and tested them on your data set. And for instance you would know that the XGBoost model would be the most performing one. Here you can check the upper parameters that have been used and the metric values, the learning curves and some feature importance. And so this is it. You have a lot of information about your data set that you can leverage to build some interesting machine learning model. And finally you could use some examples that we put here for you. So this is an example of an application that someone did on data.gov.fr And this is a very simple code that we did for you just to have an idea of what you could do with this data set. So I come back to the few slides I want to share with you. So we had a look at the website and as you can imagine we would like to improve our platform. Namely we would like to have more and more quality data sets. So this means that we would like to improve the methods that we use in order to select these data sets. So we would like to improve our filters for instance. And in order to do that we need to understand which are the meta features. So here are some of the meta features of each one of our data sets that you can see. And we would like to understand which of these meta features would most influence the algorithm performance. So the question we want to answer to is what makes a data set a good data set for machine learning. So there are many ways to try to understand that. The first experiment that we did was to do a simple linear regression on the meta features of the data sets that we have until now using the metric value of algorithms as a target variable. So what we found out was that the size of the data set and the number of categories, categorical variables, I'm sorry, were the ones that influenced more the algorithm's performance. So maybe in the future we could use these informations to improve our filters and to better understand when a data set is a good data set, good data set for machine learning. So this leads us to the last, the ideas that we have for the future. So we would like to keep on investigating what makes a data set a good data set for machine learning. This will help us increase the number of data sets we have on the platform. And also, as I told you, we are very interested about testing machine learning applications with open data. So we plan on doing this, for instance, on scikit-learn examples. And finally, we believe that it is very important to create a stronger link with the data.gov.fr community and to encourage people to use these data sets. And ideally we would also like to generalize our methodology to other data platforms. So here are just some key takeaways for you. So we saw there is a lack of open data in machine learning applications and research. And to tackle this, we created the GML. So data.gov.fr machine learning, you can see the link here and you can maybe try some applications yourself. And we'll show you the methodology that we would like also to improve that we used to identify data sets that are adequate for machine learning. So we hope this is a first step to use more open data in machine learning. And I thank you very much for your attention. And I welcome any questions you would like to ask me. Thank you so much, Julia. This is great. We have one question currently. And everyone is welcome to add more questions if you have them. So a question that has been asked is about one of the slides you had, I think slide eight. Someone says they'd love to learn more about your choice of filters and conditions as presented on this slide. Okay. So we tried to choose filters that allowed us to select data sets that could be used to train and test a meaningful algorithm. So this is why, for instance, we removed highly correlated columns or categorical columns with high cardinality because it wouldn't really make sense, for instance, to have columns that are too correlated because it would bias our model. And also for the first filter, it was not actually that easy to define what were the features that make a data set good for these purposes. So we had a look at some results from literature and some others were mostly, I'd say, common sense conditions that we came up with.