 So my name is Theridun and I'm a master's student in AI and data science and I'm currently writing my master's thesis at the CWI in Amsterdam and that's a research institute for computer science and mathematics in Netherlands and my master's thesis is part of a project on optimizing large-scale data integration which involves two huge steps which are one smaller step, which is the extraction of tabular data out of tabular data sources and then the integration of those tables and loading them into a database to make them explorable and yeah, of course the whole purpose of this is to reduce the data genital work, so the data pre-processing which takes a lot of effort in the life of data scientists estimation 50-80 percent, some say even more so there's a lot of a lot of potential gain to get from that. The part I was focusing on was the parsing, different very heterogeneous tabular data sources and I only focus on CSV files, so and one would expect that those CSV files are already very standardized, but of course as we already heard also in the other talks so far there are a lot of issues with reading clean tabular data from CSV files and yeah, what we want to get out of those files in the end is what we what we call a analysis-ready data frame, so what we know in what we have in Python or our data frame table in a canonical form which has one header row kind of specific data types yeah, and which has in best case contains tidy data, which means that we have one observation per row and one variable per column so but we looked at datasets from DataGraph UK and we encountered a lot of different issues like ranging from not UTF-8 encoding, which is from different CSV dialects, which we would have to detect syntax errors in this dialects, meta data in the CSV files like titular, footer, and so on, and I just show some examples This file has an encoding issue and has a sum in the last row so if we want to read that and process it automatically we don't need the sum because later on we want to analyze it anyways if we want to have the sum we can just sum over that column Yeah, we have visual elements in the table, so they are made to be they seem to be made for human readability but Yeah, when we want to read them automatically, that's of course not good We can also have multiple header rows because if we can and we even have in that case then a variable which is a headcount and Which is spread across different columns so and to get them into one column we would have to first have to shrink the three headers into one and then reshape reshape the table This example is maybe a bit more clear so we have Variable names or variables in the header row, so what we would like to have in the end is a Transposed version of it. I will show that later and And so on so multiple multiple tables per file a Lot of a lot of things Which are just Frustrating So if we look at the whole corpus of data of UK We see that It's it's from end of from end of August last year there were around 20,000 CSV files and 17,000 of them were actually parsable with Panda CSV parser, which is quite robust and has some detection heuristics, but if you look deeper into it, there's still issues which are not covered by that parser and From that sample we we had we just looked at 100 tables and really looked at Random sample and really looked at detailed issues of that tables and If we extrapolate this number then we end up by yeah only one-third of this whole corpus not having issues and the rest has some some kind of things which have to be fixed and Yeah, if you look at what State of the art CSV parsers are doing then We see they are helping us already. So we have like in coding guessing. We have the dialect sniffer We also have robust error handling when a column has one element more than might just throw throw out this that column or remove the the The value that we have header guesser and So on and we have a data type detection, but the middle part like having multiple tables in a file That's of course not covered by those parsers because that's not That's not what's supposed to be in a CSV file. So we also can't blame the Producers of those those parsers that they did something wrong But if we want to do data type detection in the end then we can't do that across multiple tables in one file so We thought okay, we We just have to add those modules into those parsers but At the second and second thought we thought okay But if we add those elements to those parsers if we do multiple Like white narrow data detection and we actually start reshaping the the table and we change a lot And if we do that based on wrong assumptions somewhere Like if the header detection failed and we start reshaping the table then the results are completely messed up And we actually change the data and that's definitely not what we want to do so Yeah, those current parts they cannot really handle a complex data structure a table structures and One one wrong assumption in this process breaks the whole parsing result and They're mostly somehow optimized towards speed towards reading huge CSV fights very quickly of course, but if you look at the distribution of Size of those tables a data of UK for example, they're they're mostly small They are probably often coming from CSV from Excel spreadsheet exports and so on so what We're not dealing with huge files. So it's about reading small files But then reading them correctly So for example, if we have this The small CSV file what would you expect what will happen to that if we have a no, we have dilate guess on so We would expect maybe that Yeah, this will actually always happen. We tried it with different parser also with dilate guessing and I couldn't find one which actually parses it correctly because this is of course what What you would expect when we look at it already what should happen So Why is that not working? We have this chain of different guesser and Yeah, the dialect sniffer is looking at semicolons and commas in the in these different rows and they are actually in this example There are There are two two commas three three commas in each row and Three semicolons, so it's 50 50 so the the parser just has to To choose one and it goes for the comma because that's the The most probable assumption Because that's closer to to the standard probably So Um There are some dependencies because if we would have known that the data type is more consistent in the second parsing option Then we might have reconsidered the the parsing choice we made in the in the dialect sniffing so they are They are loops and Those we want to to untangle somehow So and we want to avoid that the parsing process breaks By just one assumption in the in the chain so We have one one motto at the cwi or at this in our group a bit. It's like computation time is cheaper than than human time so, um, we thought okay, let's Let's throw computation power at the problem and See so the so the idea is to have um to have this parsing To look at this parsing from a more holistic point of view So, um instead of generating one parsing result, we generate a lot And decide in the end which of those parsing results looks best um So we call that multi hypothesis parsing and um Yeah, we see those this tree gets built up on On different assumptions every which are made in every in every step of the parsing process So and um a regular Every in every step we have heuristics Which are guessing and coding dialect whatsoever And they of course have certain probabilities that That one or the other holds So if we look what a regular parser would do it would always go for the For the most probable path. It's a kind of greedy and goes for the Um Yeah For this one, but if we Look at the results we actually May want another pass in that uh in that parsing tree so, um But now the question is how do how can we identify that? that um This table in the middle is the good one So and that's um That's why we need somehow to measure the quality of Tables we get out of this parsing process So um The idea is to take um to look at different Properties of the table to first of all we want to look at we don't want to Generate tables as I said before which are far away from the from the original input So it should stay close to what we read in the beginning But on the other hand the table shape is important if we have if we pass The table and we have five different results and Some results are very um nicely squared square shaped and one has only one row One one column Then we can be pretty sure that this parsing result is probably not as good as the other ones And so on We can also look at consistency of values inside columns. It would have solved the The the case before them would have seen okay in one parsing. We have the Currencies in each column and in the other one. It's uh less consistent So um We um we built a quality metric out of this and evaluated on 100 um Manually clean tables against their messy versions and uh We could definitely see a difference between so we could measure The difference between a good and a bad table, but it's of course not significant. So somehow we want to We want to get a bigger training set for for our quality metric and um Yeah, I think that's that's why I would also ask you maybe if you have ideas Where we could get messy tables right next to the clean versions I think a lot of people are doing that on a daily basis cleaning data and So these datasets have to be somewhere where I couldn't find Any yet So if you have suggestions, I would be very happy to know and um Now of course so so to wrap it up a bit In the end we will um End up the goal is To have this multi hypothesis framework um Which is modular so we can define different different parsing steps And it has a very That's well defined and untangled Detecting and parsing steps and there's it's always stepwise so we can say okay first we Detect encoding we read the text then we go to the next level Detect in the dialect Pass it accordingly and so on and um we can even include stepwise error handling if we say okay if the parsing of the dialect fails then we Choose to uh To solve the problem in different ways. We don't necessarily have to like a regular parser It has to spit out one one result So if something fails along the way it could maybe fix it somehow But it has to choose for one option now We can fix it in five different ways and see in the end which fix turns out to be the best one And of course if We at the end of the tree we can't find the optimal solution Then we can still Then we can still ask the user and and propose Like we have those three different tables what which one would you Think as closest to the To what is actually in the in the messy table So and uh, yeah the goal is of course to have tidy tables in the end and um We are currently working on a on a framework in r to um to enable that and um It's more or less in a prototype phase. So um We are not having any um public source code yet, but we are planning to um To make it public and uh To make it possible to develop on modules To make um, maybe more extensive or different table shape detection modules for it so um well That's um That's it Thank you So um Any questions? Yeah, I think I I think if we are using it on the um datagraph uk corpus We we are only dealing with very small files. So in that case it wouldn't um It wouldn't change so much if we would or in Performance wise at least if we look at the small files, but um, I think the problem also changes if we go to bigger files which are A gigabyte or 10 or whatever then the person who produces the file is also not Producing it for human readability probably so then we are we get rid of a lot of issues which we have in the smaller files But in that case it would be of course, uh Yeah, it would be better to look only at the fraction of the file Yeah Yeah, it could be like the data type detection uh module in the end which is doing that It's um We are we are not yet implementing the heuristics itself, but more the um the structure To to allow that so if we have uh cellophane numbers or zip codes or whatever or currencies then It the framework would allow to implement them So