 Thank you very much Zep for the nice introduction So today I will not talk about brain imaging. This is a new area of research that that we've started in the team on dirty data and The reason we've started this is that as we all know data science is 80% of the time spending on preparing the data and 20% of the time spending complaining about the need to prepare the data So let's address those 20% of the time and Really that the thing is once again with the modern a machine learning tools such as a circuit learn Machine learning is easy and fun then we like to do it But the problem is really getting the data into the learner An industry surveys show this this was an industry survey by Kaggle a few years ago and it asked, you know, what's the most blocking Aspect of running a data science project in your organization dirty data came on top, you know above things like hiring the right talent So dirty data, you know seeing this we thought well, let's let's let's tackle the problem And when we thought let's tackle the problem. We didn't know what it meant. I'm not sure we know these days I guess everybody has their own dirty data problem, but but at least we've understood a few things and One thing that that we've understood is that every machine learning research paper starts by let X and numerical matrix that lives in a Matrix space and if we're gonna implement this it's going to be well, you know Give me your data as an empire and we've always said, you know, sure you're gonna have to transform your data from We're the kind of data you have to the Empire, but that's your job not us So yes in real life the data Best case comes like this often as a pen as a data frame, so it's not exactly a numerical matrix and The first thing is that we can We will need to transform the different columns in different ways to cast this to a Numerical array and I want to talk a bit about how to do this with With a scikit-learn because scikit-learn has gotten much more pleasant in the last Few years to do this, but then we're gonna hit a set of hard problems and one of them is the fact that one of these columns is Not a well-formatted categorical column and for machine learning it falls a bit between the cracks and Another one is that we might have missing values and that also raises problems so The outline of my talk is going to be that I'll talk a bit about Transforming columns with scikit-learn and here I just want to emphasize a bit things that are feasible with modern scikit-learn and that Can make your life easier And this is just vanilla scikit-learn then I'll talk about the problems of dirty category and this is more a Researchy talk, even though we do have software that you can use and then I'll talk about the problems of Learning with missing values and this is more of a statistical talk, but there will be a take-home messages So column transforming the goal is to start with pentadata frames and come out with a well-formatted NumPy array that can easily be plugged in statistics such as Scikit-learn, so it's a pre-processing problem so often the way we get our data is we read it from a CSV file, so we could do this with pandas and we'll get a Data frame that has different types in different columns and our goal is going to be to convert all those values to numerical values And let's look at gender So gender where so it's a categorical column and we're going to transform it to a numerical column Understanded way to do this is to use one hot encoding so in scikit-learn We'll use clearing up pre-processing that one hot encoder and then we'll call the fit transform method on the column gender And it's going to use to output indicator columns with zero and ones that indicate the different genders, okay? Now for dates we could use the pandas date time support So pandas deals quite well natively with with this kind of Strings that knows how to convert them to the date time object and Once we have the date time objects we can take its value in float and it's a value in I believe milliseconds to the epoch And so it's a numerical value that is reasonably well ordered and hopefully we can learn from it. Hopefully so Something I'd like to stress is that in scikit-learn we like to work with things that we call transformers and If we look at the one hot encoder We can actually split the Fitting of the one encoder and the transforming the the idea is that during the fitting We're storing which categories are present in the data and during the transforming We're encoding this data accordingly So this separation between fit and transforming is quite important because it avoids data leakage Between the train and the test set when we're evaluating the pipeline and we can also store the fitted Transformer and apply it to new data at predict time for production For instance, and it can be used with a bunch of tools in scikit-learn such as the pipeline or the Cross valve score that is used to do a cross validation So for dates It might be useful to shoehorn or a panda Code into such a process and for this we can use the function transformer so we can define a small function that will take as input the Pandas data frame or the pandas column We're interested in and then return as output a 2d Array of numerical values and for this it's just you know taking the code that we were doing with pandas Putting it in the proper function and making sure that we're returning to the output and Then once we have this we can use a scleron dot preprocessing that function transformer give it this function and Tell it that we don't want validation because if we want validation it's going to try to check that the Data is well-formatted at the input and it's not so it will complain So function transformer can be a bit more clever. You can tell it how to inverse transform It's a more more soft-festicated tool And I don't I won't go into details What I just want to stress here is that it can be useful to look at the modern preprocessing documentation of scikit-learn because it has many useful tools For this purpose and once again pipelines are good Now, how do we put a pandas data frame in a pipeline and apply different transformers to the different columns? for this we can use the column transformer object the column transformer object will take a list of pairs of Transformers and Selectors of columns and selectors of columns can be for instance column names, okay? So here with this code I'm telling that I want to apply a 100 coder to the gender and employee position title columns and my date transformer to the date first hire and Now I can call the column transformer on a data frame. It will do all the magic and Comes out an empire and so I can build complicated pipelines Using this kind of patterns to get my raw data at least my raw data frame in and then use scikit-learn on this so this is useful for cross-validation for instance and The benefit really is that we can use all the tools in scikit-learn for for model selection such as for instance, we could pipeline this column transformer with Fast gradient boosting classifier. That's new in in 021 And then just apply apply cross-validation on the raw data, okay? so if you're not using it you should probably be using it if you Think it can be improved by an issue Now if we do this on the example that I'm using we're gonna hit a problem and the problem is with the employee position title And really the the reason is that there are many many different entries in this title For 10,000 rows there are 400 unique entries So that will it will lead to a bunch of different problems And some of them are numerical it's just gonna take well computational It's just going to take a lot of time to run but some of them are statistical and The reason they're statistical is that we might have some rare categories. There's only one instance of architect three in the data set We might have some overlapping categories We have different instances of police officers and the link between those instances is not obvious as we if we don't Look at the screen this the string content and if we just look consider these things as discrete categories and Finally, it's a detail, but it's a real problem in practice. We might have new categories in the test set So basically what how to encoder doesn't work well at all with this kind of data and Sometimes we have this kind of data so the standard practice to do this is to use to resort to data curation cleaning your data it can be seen the Smosley techniques from database normalization And so one thing that we could do is we could do feature engineering and we could try to separate The position from the rank and maybe we could separate the position the rank in the department So this would require us Building rules that we might apply in pandas on strings to separate those things out The problem is it's going to take a little while to build those rules and they usually have to be Handcrafted Another related problem For instance in a different database here We have a database where we have company names and we have the same company that's expressed under different names now That's a canonical problem Database curation and it's known as deduplication or record linkage and the goal being to output a clean database Basically to merge those different entries those different entities and represent them as the same entity Now this is quite difficult to do in general without supervision You usually need an expert that shows a set of mergers to have an algorithm learn how to do the mergers and One problem is that it can be suboptimal because here The data set here that the challenge is to detect fraud with payments to doctors and It's a real question on whether we should merge the Pfizer Hong Kong branch with the Pfizer Korean branch Maybe they should be considered as the same entity and maybe not that really depends on the question at hand so The problem with this with this view is that the goal is to output a clean database Which is a maybe that may be a question specific A point of view what is a clean database and in general it's something super hard So really these things all these IDs are Hard to make automatic and to make turnkey and I'd like to claim that they're much harder than supervised learning supervised learning so supervised machine learning is Is a toolbox that works quite well as long as you have a supervision signal Database cleaning is a hard problem and you will need a supervision signal But that supervision signal is basically a clean database So usually clean database cleaning you first have somebody clean part of the database Then you learn rule from this and then you clean the rest of the database so Our goal here is not database cleaning. It's working directly on the dirty data and Doing good machine learning on the dirty data really the point being that the Statistical questions so the supervised learning problem should inform the curation and ideally we shouldn't even curate So a first work we did with patricio's Serta, I should stress that this part is really the work of patricio Serta who's doing a PhD in In my group So the first thing that we did is that we took one hot encoding and we relaxed it and basically instead of having zeros and ones We added string distances between the representations of the of the categories and we encoded with string distances instead of zeros and one and That really tackles the problem of new categories in the test set because if there's a new category in the test set That's not represented in the train set I can just look at the string distances to the categories in the in the train set and it also allows us to link categories If for instance, I have typos in my columns Which is something that does happen the typos are going to give me very small string distances and those two columns are going to look very similar so There are different string similarities that we could be using maybe the most well-known one is the Levenstein distance the Levenstein distance is basically the number of edits that we need to do to one string to match the other That's really a classic one. There's the Jarrow will will clear distance is The number of matching characters Renormalized by the number of trend positions character transpositions It's well used in in the in the database community and there's what I call the n-gram or jacar similarity if we define n-gram as a group of n consecutive characters So for instance, if I have London the first n-gram will be LO n the second n-gram will be OND the third n-gram NDO so we're basically taking all those n-grams here. These are three grams We're taking all the three grams and then to compute the distance between two strings We're looking at the number of n-grams in common between the two screens and the number of n-grams divided by the number of n-grams total Okay, so the two strings are the same They have all the n-grams in common So this is one so this is a similarity if they're completely different. They have no n-gram in common So this is zero, okay So these are three classic string similarities So because this is a Python conference, we're giving you a Python implementation We have this software that we call dirty cat for dirty category and it allows me to put pictures of cats on my slides Crucial It's available online VSD license and everything It's something in between a research quality software to production quality software. I I think it's Reasonably good quality. It's not as high quality as scikit-learn, but it comes with documentation examples and everything you can you can look at it. It also comes with example data sets and It provides similarity encoding so similarity encoder is just an encoder It will work like scikit-learn you can instantiated saying which similarity you want to use and then you can transform The column of the data frame or the data frame you're interested in transforming So it's a drop-in replacement for one hot encoding in scikit-learn Now I'll show you how it performs on real data But before I show you how it performs in real data Let me present another approach that is it's been around for quite a while. That's called target encoder It's not known well enough The idea being that we're going to represent each category by the average target So for instance, we're going to represent the police officer 3 by the average salary The police officer 3 in our database if we're trying to predict the salary Right, so this gives us a 1d representation of all the all the categories I've shown in here So all our categories are embedded in one dimensional which is the average salary So we have the average salary and so you can see that in the database the person who makes the least amount of money Is the crossing guard and the person who makes the most amount of money is the manager 3 manager 2 actually So by the way, this is maybe a bit surprising the order in managers doesn't make sense Right, we have manager 3 who makes less money than manager 1 who makes less money than manager 2. Why is that? Because those are average Salaries and we might have people with different level of experience or I don't know what and also is telling us that This signal is not a perfect signal. It's a noisy signal this this embedding But it's useful because it's embedding all the categories Close by when they have the same link to why so that's helping us build a simple Decision function to do prediction from this representation. Okay Now this is it comes with the drawbacks The first one is it doesn't know how to deal with a new category If I tell you a category if I give you a category that I've never seen I don't know the average salary I can't represent it so I can I can represent it by the average salary of everybody, but that's Losing a bit of information and the other thing is it's absolutely not using the string structure of the category So a typos for instance it will not find the links between typos unless it sees enough of those typos to see that They basically link to the print the targets in the same way So I'd like to say really that it's a complementary approach to our approach. It takes a different point of view And is it's very interesting too It's also available in dirty cat because our goal and dirty cat is not to Sell the methods that we develop but to help solve a problem Which is dirty categories? so it's target encoder oops and I was editing this too late yesterday evening and Target encoder does not take a similarity Argument so patricio started out did an Numerical benchmarks on real-life data sets here using seven real-life data sets to compare the different approaches and We benchmark linear models in gradient boosted trees and what I'm showing you here is the average rank of the different methods Across the different data sets. So one would be that the method was always the best predictor Across all the data sets. And so what you can see and we so there's more in the paper We benchmark many other methods, but I'm really giving the executive summary because many of the methods that we benchmarked We're not helpful So what you can see is that target encoder? helps So with gradient boosted trees that helps compare it to what hot encoding One thing that that is not visible in those numbers is that gradient boosted trees do much better Than linear models. So I would advise you to focus on gradient boosted trees in practice They're much more useful for this kind of data set so target encoding helps a lot and Then in the similarity encoding what we found is that the 3 gram distance the 3 gram similarity was really the most helpful and the others are Not most not as helpful. So our take-home message is really we can focus on similarity encoding with 3 gram distance Though my it might be useful for instance to build a pipeline that stacks Both a target encoding and similarity encoding because these two objects capture different information in the data And that's easy to do by the way with a column transformer. You just Select the column twice and and send it in in the two different encoders okay now in practice We're gonna hit the problem is that in many though not all Databases the number of different categories Grows with the amount of data here. That's the that's the second work that We did with patricio and we're now we've moved we've gathered more data set and now we've moved to 17 data sets It's actually hard to find data sets that are not curated and with an open license People do not like to share their non curated data set. So please please do share your non curated data set That's the only way we can develop better methods. So So what you're seeing here is that across many data sets as we increase the number of row or the number of different entries that we're seeing in a given column increases and increases Sometimes very fast sometimes only Slightly faster, but it gives a problem because it means that if we're going to use the the Seminarity encoder We're gonna blow up that I mentioned and we're gonna end up running gradient boosting on things that have a hundred thousand features, which is not only bad statistically But also will take a lot of time So really the oh and yeah, so what there's this is this is related to Problems in for instance natural language processing Whereas the corpus of the text gets bigger The number of different words that we see keeps increasing. So in it's it's quite related to classical natural language processing problems so We need to tackle this elsewhere. We can't give this to you as a turnkey method that you can apply to larger databases So both the similarity encoding and one hot encoding are prototype methods what I mean by prototype method is that they compare the Data to a set of prototypes and by default. It's the all the prototypes on the on the training set The challenge now is to choose a small number of prototypes to be able to scale So we can take all the training set. That's what we're taking my default It blows up. We can take the most frequent But it's a strategy that's easy to game you can easily have a Dataset that breaks the strategy and one of the problems is that the most natural prototypes may not be in the training set for instance if my training set is made of Big cat fat cat big dog fat dog I probably want to break this in big and fat and cat and dog and none of these original entries actually have the right terms Okay, so I need basically to break down my my categories. So now I'll tell you how we estimate those prototypes and The thing that is going to save us is that why those those different strings grow They have common information Here I'm showing you the growth in the number of three grams as I increase the number of strings and what you can see is that It's a smaller growth than the number of different strings and this makes sense because for instance if this Dirtiness this diversity of the string is made from typos Then typos actually modify a small fraction of the string So yes, I will I will have new three grams, but most of the three grams will be in common in Practice if I look at my data sets you can really clearly see this that the substrings are in common For instance in this drug name Dataset I can see that I have many different versions of alcohol, but they're all versions of alcohol So there's alcohol in common everywhere in my employee salary problem I have substrings that are really meaningful police is in common Officer is in common technician senior So the challenge is going to grab this information and capture those meaningful substrings and for this we're going to use techniques from topic modeling and in natural Language processing and we're going to apply topic modeling on substrings So what we're going to do is that we're going to represent all the all the strings as Their substrings using an n-gram representation and here I'm shown a three gram representation, but in practice We're doing a bit something slightly more sophisticated than this We're taking the two grams the three grams the four grams and also the worst that we've split with a set of separator separating characters that We have default values, but you can change them So then we build a big matrix that represents each entry by its substrings and then we apply matrix factorization on this really matrix factorization what it's doing here is To say I will I will separate this matrix in two matrices One matrix that is what I call the descriptions of the latent categories and it tells me what substrings are present in a latent category And another matrix which is what latent categories are present in a given entry, okay? So I'm really factorizing in the description of latent categories categories that I'm Inferring from the data or prototypes and how those are expressed in the data so to give you an example of The result this is and so by the way we're using the activation We're using the activation matrix So the the one that expressed which categories are which latent categories are in an entry We're using this to represent the data and so this is what I'm showing here These are those employee salary and the employee position titles and this is and I've run the model with a dimensionality of 8 and This is the Loading is that are showed so what you can if you squint your eyes what you can see here is that it has detected something like a technician like Legal police it has detected those substrings Okay, and so one thing I'm not showing here that I should be is that we're using a Heuristic to give a name to those columns and the name is really what's the what are the three words that are most represented in those columns? so This is useful because it gives you it's giving you feature names. We're encoding this with feature names and So if we compared to a similarity encoder, it's much more marked much more present much more interpretable and then we can do Data science interpretable data science and for instance we can look at permutation importances of Gradient boosted trees for instance with the string with the categories that were inferred from the data and This is what I get here So what I'm showing you here is that I've inferred from this messy data. I've inferred latent categories That makes sense and on which I can do an analysis and present it to you And it also by the way predicts well in the paper. We're showing that it gives you good prediction. Okay? So you don't have to clean your data anymore So now I want to talk about one last thing which is learning with missing values So we've dealt with this non-formatted Categorical data and now we need to deal with the fact that some of our values are missing And so why doesn't the bloody machine learning toolkit work on this? There is a fundamental reason is that machine learning models in general Tend to need entries in a vector space or at least a metric space or at least an ordered space It's just easier for Machine learning to draw analogies if it knows links between data and missing value is nowhere there So it's slightly more than an implementation problem. There is a fundamental problem there There is a very very advanced and thorough literature on missing values in statistics and Let me summarize it really quickly for you the canonical model is that we have a a Generating process for the complete data and Be a random process that occludes the entries this is really the conceptual model on which the Classic results Stand up on and then there's a really classic Situation which is known as missing at random Mar that says hand waving that for non-observed values the probability of missing this does not depend on the non-observed value This this might seem a bit mind-blowing if you look at the actual definition. It's even more mind-blowing and People simplify it because it doesn't really make sense and it's true It doesn't really make sense. The reason there is this definition is that it allows in a likelihood framework to prove that and that was proven by Ruben 40 years ago that Maximizing the likelihood of the observed data while ignoring Marginalizing in technical term the unobserved value will give the maximum likelihood of model a of the model of the complete generate data generative process Okay, so it means that if you are modeling your data if you're doing classic statistics or modeling your data with likelihood models that you believe and You believe you have an occluding process You can still do the you can still solve the problem when you're in a missing at random situation Conversely if you're so and missing completely at random is a special case of this situation where the missingness is independent from the data and It's an easier. So it's a special case of missing at random and it's easier to understand and the theorem still applies now Conversely if you're in a missing not at random if you're not in this situation then missing this is not Ignorable if you try to maximize the likelihood while ignoring the missing data. You will have problems In practice, what does it look like? I've shown you a complete data. I've shown you a missing completely at random So basically I've sub sampled So here I'm deleting my missing values. They're not on the data set So I've sub sampled it's a problem and I'm missing I'm showing you missing not at random And what you're seeing here is that we have some form of censoring process and part of the data distribution is not well represented Okay So this will give problem Now I would like to say that this classic statistical point of view is not of interest to us here or at least not completely of interest and we shouldn't take those results as Fundamental results for machine learning The two reasons one is there is not always a non-observed value. For instance, what the what is the age of the spouse of people are single? So even this assumption is broken in many many many data sets and the second one is that we're not trying to maximize likelihoods we're trying to predict Now based on this we can just do machine learning But the bloody machine learning toolkit still doesn't work. I've given you theory not practice Okay practice. I'll come back to this theory later practice We can impute and this goes back to the theory before Imputing means we're gonna fill in the information. We're gonna guess things for those values We haven't seen and once again, there's a large statistical literature, but it's Focused on in sample testing doesn't tell you how to complete the test set and doesn't tell you what to do with the prediction so let me Cover a bit the tools we have in scikit-learn. There's mean imputation, which is a special case of univariate imputation and We can for instance replace the missing values with the mean of the features. So this is done with the simple imputer There's conditional imputation the idea being that you're modeling one feature as a function of the other And then you can learn predictive models across features And then you can predict missing values, okay? They're classic implementations and are and we now have an impute an implementation in Scikit-learn that can do this with linear models or a random force or other things The classic point of view tells you that missing that mean imputation is a very very very bad thing Because it will distort the the distribution So as you can see here, I've imputed the missing data with the mean and you can see that we're collapsing the variance of the data Along one direction, so we shouldn't be doing that Classic point of view and there are conditions that are known as congeniality conditions on an imputation That tell you that a good imputation much much much preserve the data properties used by the later analysis step Now we've looked at supervised learning in this setting and we've shown we've proven that if the Learner is powerful enough like a random forest or a good in boosted tree imputing both the test and the train with the mean of the train is consistent in the sense that it converges to the best possible prediction and The reason is a we're not trying to maximize likelihoods Be the learner will learn to recognize the imputed entries and will come compensate for them So the learner basically learns those biases in the distribution and fixes them So we don't have to worry about the classical results And practice you can see it here I'm comparing mean imputation and iterative imputation and what we can see is that if I have enough data They perform as well if I don't have enough data, then the iterative imputer. There's better The notebooks are online and the slides are online So conclusion is when we have enough data iterative imputer is not nest is not necessary Mean imputation is enough, but when we don't have enough data iterative imputer helps No, it may not be enough imputation may not be enough and Here's a pathological example Why what I'm trying to predict depends only on whether the data is missing or not Suppose I'm trying to predict Frode and the only signal about Frode is that people have not filled in some information So this this will fall into missing not at random situations and in such a situation imputing makes the prediction impossible, okay, so if I impute I'm Losing this information and I can't predict anymore So what's the solution the solution is to add a missing this indicator an extra column that tells me Whether or not the data was present so I can impute but also expose to the learner whether or not the data was present And if I do this So this is this is another simulation where we have specific censoring in the data And if I do this what you can see is that both the mean and the iterative imputer are consistent they converge with to the to the best prediction if there is the indicator But the iterative imputer doesn't work. Well at all if there is not the Indicator and also what we can see is that here so adding this this indicator this mosque is absolutely crucial and the other thing that we can see is that iterative imputation in this situation is actually detrimental because It's making it hotter For the learner to see this missing this pattern All right So basically we have two situations one where the missingness is not informative in which case The iterative imputer is better one where missing this is informative in which case actually iterative imputer can harm Because it makes it harder for the learner to learn this this informative missing this. Okay Now to wrap up learning on dirty data First take home message prepare the data via column transformer. That's easy Second take home message use gradient boosting in my experience. It really works. Well in this kind of data It's robust to all kind of word entries in the data first thing you should try probably Dirty categories, so we're interested in statistical modeling on non curated categorical data Please help us and give us your dirty data with a prediction task It helps us benchmark what we do It's very important and we have similarity encoding and More work that's coming up really soon Supervised learning with missing data mean imputation with a missing indicator is actually a pretty good choice There are many more results in the paper and in general if you're interested in this area of research We have this research project that we call dirty data or there is ongoing research and there will be more. Thank you Thank you, Gael. We have five minutes for questions. Please come to the microphones in the aisles Thanks liked to talk a lot Little bit Maybe not the best question for Europe Python Is there a version of dirty cat also for are and if not, do you think it would be easy to port it? Dirty cat should be fairly easy to code it the well dirty cat dirty cat has several things and it will grow But both targeting coding so targeting coding the implementation one of our colleagues Yoris When there's that Bosch was also pen is developer found that there was a better way to do target encoding So we're gonna fix this we're gonna improve target encoding But both target encoding and similarity encoding are fairly easy to code code the n-gram version for similar to encoding Don't bother about the other ones But yeah, please do it go ahead. There's one in in in spark Hi, thank you for the talk. It was very interesting. I have two questions The first one is why three why the n-gram number three is what did you test the other numbers or is three the Golden standard and everyone should use no three was more for didactic reason in practice What we're using in these days is the two the three the four and And the substrings that are it's separated by space Specific characters such as spaces and this we did benchmark, but we only have 17 data sets So our benchmarks are not fully trustworthy. We need more benchmarks to know more data sets to do more benchmark Yeah, and the second question is what if you have missing data in the dirty category? What if you do not know if it's a police officer or a janitor or something? Good question. Yeah, I forgot Yeah, I should have mentioned this so My my missing data is more a problem for continuous values For categorical value I would advise in general to basically just add an indicator to represent The the missing value as a specific value in your encoding which could be zero zero zero zero by the way Thank you We're interesting talk practical as well. Thanks Do you have a plan to look into active learning at some point as well? I mean, I think on practice on real-world problems that might be interesting This is not a research agenda. Our research agenda is to put the human out of the loop But it is true that active learning for database curation is extremely useful and it probably Complements what we're doing. Thanks. Please give a round of applause to Gael. Thanks so much