 Yeah, thank you very much for joining. So we're talking about Problem in management reporting. It's I have to say upfront. It's not a fancy problem But a very problem pragmatic and a very common problem in companies nowadays So first, let me introduce you what is management reporting? So it's you need companies need to manage like predictions company look into some key KPIs learn Help these leaders to make informed decisions Look up things if the data they receive is correct or and yeah, basically It's very important to navigate the company and how to navigate strategic decisions and of course It's pretty far from what we do. It's very often pretty far from people who are programming or doing data science But it's very number-heavy. So it's a crucial for a company's success on the task of controllers are Like financial reporting budgeting forecasting internal controls Like how budgets are spent do the numbers they receive actually make sense or if they Do not fit to the expectations to follow up Hey, for example, like if you are have you have a shop and there's like Christmas stuff and you have a higher expectation on the sales of Christmas stuff You would ask, okay, why did we sell less? Christmas stuff than expected and a possible answer could be and it's not always like people make mistakes in the past years Valuable answer could have been yeah, sorry corona something like that It's just to under also understand what affects sales and the successes and on the company So there's a lot of like number crunching involved. So and data accuracy is absolutely critical. So On one hand from crunching some numbers doesn't sound that so hard So the thing here is there's a lot of financial data from Structured systems like there's tons of excel sheets But there's like really some companies thousands of different reports and these reports are made By systems by people very often people Who create for example these extra sheets also want to make them visually appearing to to for example to Show them to their superiors and that's why also like very often the structure changes So there is no quality Process or like a pre definition how these reports have to look I know we all know how we could do it better nowadays with the technology But these are processes that are like in companies that are operational for quite some time And they also need to address be addressed. So there's also like small data sets like for example like a Shop a shy a file with listing all the shops square meters and stuff There's stuff that doesn't really make sense to build a database for that Sometimes it's just like a simple extra sheet, but it's important for example See to compare the sales to the size and square meters of the shop or also because it's becoming more and more important To see what's the anus energy consumption of this shop per square meter for example And this is also things controllers look into as well. So as you see we have tons of requirements and pretty open Yeah, definition of how the data structures are so the use case when we started was Okay, the process before is like, okay Go to the network drive fixer pick reports, which are usually generated Like monthly or quarterly basically pull them out find anomalies and finding anomalies is Basically looking at the numbers like just like reading them looking for outliers Maybe some stuff with Excel and then follow up on them and follow up on them Basically would need to basically. Yeah, let's write an email and wait for feedback and the person like on the upper left side We see the person who hands in the report and down here is the control and then basically always like say, okay Follow up. Hey, there's a question. You get feedback by email. Say no that was right. We're sorry heavy weather sales were better or worse for whatever reasons and Basically, they all accessed in what I basically already flagged as data silo. It's network drives It's not so it's network drive for some structures in folders and other things so you see there's a lot of communication going on and then What are the problems here? Communication effort Misunderstandings because when people talk there's always misunderstandings. There's a heavy There's a manual selection like pulling out the reports from the network drive There is also like a lot of human bias involved So for example, if you know again, it's that it's a report from that shop again They always mess up something like that So there might be also like a strong bias involved and of course It makes also like a big difference if you read reports in the morning when you're really fresh or like in the afternoon So and there's a lot of communication and misunderstandings going on here And you see there's also like a cycle that can probably never end I think until things are cleared up. So our use case was to okay We know this is not like the optimal setup for collecting and processing data for reporting as we would imagine it if we build it new but for the meantime We have to work with these reports until the company can establish a modern structure here as well. So Before hand, we had like manual review of identifying outliers anomalies following up and where you see down there with the robot arm We see okay, we can help with automating and identifying the outliers find anomalies We still have a human review process. So we have AI is just an assistant here as well Like killing the boring stuff taking a lot of the heavy lifting and the follow-up is also Automatic as well. So before if as you see in the next here This is the you just have changed the middle part and if you just look down, right? You see the corona training department has already Way less processes and things to look into so it's already so okay if there's an outlier From the expectancy the system automatically creates a message to the person who handed is responsible for the report they can comment on that and And it's stored in a system and basically when the final review is coming We have all everything collected here. And of course, this is the way more improve This is it's a this improves the process a lot because we have higher quality looking at the data We have the adaptations and comments saved in a central system and not just in email silos and it's more focus and things on the controlling part here, so Yeah, so this is we're talking about the middle use case and what are the challenges so the challenges are We are actually dealing with small data sets here. And there's many many reports. We would be too much work to To pre-configure the expectation for individual components of individual reports up front So we had to find a way to make to also automate this on there is dirty data There are some columns with a lot of missing values and that can be okay or not and Yeah, so no value could be just perfectly fine No value could also be an arrow and we had to find ways to navigate it how can we give big a system that gives the best advice to the controller and also enables them to Consecutively build up a Pre-configured system that knows more. Okay. What's the expectancy here from the data type from the numbers range for example So we can also like have a system that constantly can improve while being In in in production. So yes and the delivery of first is also up. It's usually like short time frames Yes, and so for our job was building a smart solution that can be optimized and configured all the time and And Yes, so that we I think we already covered that so different different formats and stuff like that So but it's important historic data is relevant for future predictions And only like probably most use this case is only like three years It's not like it doesn't make sense to look much further. So and yeah No, I would like to hand over to my colleague Lucas how we handled it Algorithmically Thank you, Alex So in the following the structure of the talk would be that I elaborate a little bit on okay PDF seems a common format And we had it for our customers Is it is there a way now to get this into our pipeline or is it a completely useless? Format for machines then we will talk about the methods for outlier detection and how we integrated them in our pipelines so As an anecdotal story as I said PDFs are the most common exchange format between people when you kind of want to make sure That what you send is what the other person receives, right? So in that sense, it's a really good format It's common and something that stakeholders expect the trouble is that it's a really bad format to exchange Information between machines right because it's not well structured. It's not It's not like a well-defined data structure in which you have your data type and then your columns of data But it's something that can look quite differently for very similar data inputs and What we have observed is that a common scenario is that you have Some raw data some person does reporting dust processing and Compile to report and then one year later. Nobody knows Where this raw data was and how the script looked like that eventually produced this PDF document, right? So what I've what I've illustrated here is this is something that would be Typical if you have like a simple Excel table pay 20 bucks for Adobe Creative Cloud And then you can click on edit PDF and you see how you get like a sense how the PDF is structured, right? and what you do see is That you have Those boxes that are floating around on a page and those are not like unified in any way So here for example you have a box from the object code and that is one box over all the data And here in the middle you would have a box that actually belongs to two columns Originally, but I think the reason for this year is that just the letters were close to each other and then it was exported as one box And now how did the quest one question is like we don't live in a Perfect world and we don't want to wait for the perfect world So we have valuable data and our PDF documents. Can you make something out of it, right? and One Python tool that I would like to introduce and that you possibly have heard earlier this morning As come a lot. So with come a lot It's relatively easy to read a PDF document when it's text-based, right? So if it's OCR then you are kind of Yeah, lost then There's not much hope at this moment. So with relatively few lines of code you can try to Try to pass your PDF document and get the data frame out of it for example And in this case for the simple data frame I've introduced You see that it works reasonably well However, there are some cases in which come lot was not able to extract the information and you would then get a Confidence report in addition to your actual actual date here. Should you be interested in? if you then go to more elaborate PDF documents like this for example where you have merged cells where you have One cell that is visually longer than the others But just because there's no information in the middle here that Excel would kind of let this Cell extend to more than its original margin then things get a bit more troublesome And you see that also here we have like minor Points where we would have to do manual rework And we also see that also here we have two columns that got concatenated into one So what is the what is the summary? I would say it's a problem that doesn't scale But if you have really interesting data that you want to want to incorporate into your pipeline Then this might be worth the effort And of course with the recent advent of large language models Maybe there will be a change in how we In how we Thus do this for example I could imagine if you have like a strong visual model a general purpose visual model and then Combine it with a large language model that you are actually able to do all those things on even ill structured formats like PDFs If you go to OCR, then it's really really even more difficult. So I don't think it is feasible at the moment what I would like to talk now is the to give you some idea about outliers and anomalies and methods that are commonly in use to detect them To get insights in whether the data had errors whether there's some fraud and Whether there's maybe some novelty or some anomaly in your data so first of all a Outlier is commonly defined as something that differs significantly from other observations. That's a broad definition, but Yeah, it fits well and the sources for outliers and anomalies are usually chance So it could just be a statistical Effect it could be a measurement that is that is incorrect for example a broken sensor if I think about physics For example, or it could be some novelty that you that has a cause. So it's not an error But it's something that you've never observed before and for outliers we have We need to know that the context is important. So something if you for example look at this. This is a 2d normal Gaussian distribution read some samples of it and normally what you would say, okay, this thing is It can happen, but it's really really unlikely that you observe this data point But so this would be an outlier that is just defined by looking at the at its value as compared to the rest of the other values of the same distribution But for example, if you look at this violent plot of any distribution here Normally, I would say you wouldn't call this an outlier But then if you put it into context to the rest of the data, which is a time series in this case You would see that given this context. This should indeed be classified as an outlier For example, if you if you have like a solar power plant, right and you and you observe super high power outcome in the night Then the power outcome alone doesn't tell you whether it's an outlier or not But the context gives you the idea that something might be off Okay with having this definition established. We want to go over the go over the methods and start with one method that is Particularly useful if you have a lot of information about your business data and if you have like an expert system kind of so rule-based methods were a few years ago a common thing even in computer vision that people really designed handcrafted features and of course if you work with tabular data It's something that is a legitimate tool, right? So if you have business knowledge and you know certain things need to be in in a certain range Or certain events can happen or not then rule-based methods are a legitimate tool the trouble is of course if you have many of them, it's quite complex and it's unmaintainable and Your execution order can be Can be important Whether you flag something as an outlier or not and to illustrate this In an example, right if you if you say, okay, I run my sprinklers every day at nine And then it's raining for two weeks, then this might not be a good idea to have like an if-else statement For classifying something as a novelty To summarize this rule-based methods have the strong Benefit that they're really easy to explain and explainability is something that other methods for example the learning methods usually lack and It's something when you communicate with your colleague, of course is really important They're reliable in their environment and they're simple. They're hard to maintain They're quite rigid and the scope is of course limited if you then go to more recent methods the one common example would be an isolation forest quite Similar to random forest and the assumption here is that outliers High up in the tree whereas in layers so normal data points Lower in the tree and the reason for this is that you that you have a hard time explaining an outlier So that it's hard to come up with a combination of rules for your decision tree For example, there are also support vector machines that you can train in a one-class versus The other fashion and here you would just have all your native normal data points And then your decision margin and your your potential outliers on the other side For support vector machines, it's crucial that you think about what kernel you want to use and then more over there are outer encoders where Where the idea is okay if I have an outer encoder, so I have an input and an output and I have my bottleneck here in between It's harder to map an outlier in the in a bottleneck because the assumption of an outer encoder is that you that your real data can be mapped on a lower dimensional level and If you have something that you cannot explain Then you need more information then you can have in this bottleneck of the of the outer encoder So in that sense an outer encoder is a tool that you could use to to detect outliers What we also have Now and with the recent advent of deep learning in particular quite powerful models that are Which where the trainings procedure is getting more and more standardized, right? So they're all those libraries like Pytorch Pytorch lightning Keras and so on that allow you to in a ever more simple fashion train really powerful Models, but here of course the trouble is that often times you need to do a manual annotation Deep learning models in particular are notoriously overconfident and With all models domain robustness is an issue. So just because a model worked in one Environment and namely the training environment doesn't really tell you much about an unseen environment And then what what is there left of course are statistical methods? Yeah, you can just look at the difference Of your data point to the mean or if you want to compare Distribution so you can just for example the comogroff's mu enough tests where you just compare probability distributions and then look how big is the area of the cumulative density function and then you get an estimate whether Some whether distributions are similar or not Okay Then there are clustering algorithms, but since I need to speed up a little bit I want to talk about what do you do if you don't have any information about your data And what we found useful our case for the controlling reports is for example something like Such a simple thing for example your data frame says it's a string But if you look at the at the data that is inside you know You notice that pandas just converted into a string because at one data entry. There was a white space somewhere You know so those things sounds really simple but of course it's something that that you should look at and when you want to consolidate a Disreporting pipeline and For the for I indicated here fuzzy strings and in the bottom and this metric I found really useful so there are many ways to To get a number on how similar strings and texts are and what you can do For example is you can clean up categories with it So if you see that most of your data is in I don't know like four categories And then there are a couple of others and then you might look at the string similarity and notice that it's just a type or that Maybe occurred and clean up your categories like this in a context We found that similarities are useful for example if you have we Desert in the customer had similar invoice numbers so the string similarity was really was really striking and The the amount of money that got transferred by the bank Was the same so in this case we could nail down Okay, this is actually a duplicated payment right because what people didn't know that this invoice had already been processed But we could then show that if we have a really similar invoice number and the same Context of the of the total we could save some money There and just to conclude this If we have this heuristics as for example the livenstein a string similarity Which just measures how many changes do you need to apply to get an? APPL to an apple and in this case it would be to right you have to change the e and you have to change the L So this is a common metric There's another few metrics and there are also metrics that go more into the semantic context of of a longer text What I want to stress is that data splits are really important So particularly for powerful models It's really important to know when do I split my data do I introduce data leaks if those are all data Samples and the yellow ones are my test set then possibly like the yellow and the blue one next to each other are related to each other and what I do is I leak I leak Test data into my training set, and then I don't know how well my model performs. So it's really important to know this Okay in and summary I've introduced what what outliers are they are really important to detect errors fraud and significant business events and now I would like to Show you a bit how we integrated it into code and how we Got our fingers dirty. So we use mainly pandera as the outermost layer Pandera is a tool that lets you validate data similar like pedantic for example But on data frames. So what you can do is you can do a type checking you can check by custom rules and you can also Do statistical checks? You can specify it in a dick legs Specification or you can if you're already familiar with pedantic and and data classes You can also use their Schema models and it's really easy So what you would do is you have for example a data frame that you want to validate Which should have year month day and revenue and then you can in this schema model fashion can just specify What your data frame should look like? Yeah, this is the input schema It should be greater than 2000 for the year and so on and you can cross it to the data type Should this be interpreted as a string before so then pandera would try to convert it from a string to an int Yeah, and what you can then do is you can have like, I don't know some function and Decorated or like add your type Information here and then pandera will check and validate the data frame What happens if it's not correct? It will throw an error So for example here, it would say okay at index two in my data frame. I have a failure case Or what you can also do is you can do lazy validation in which case You can collect all the errors that you found for one data frame and then get this summarized in a Error data frame kind of we integrated it into the controlling case by having global checks By have so checks that we found useful regardless of the data and a concrete report We observed and then also on report level and column level to really have like a modular Modular system how we could specify all the checks one small outlook that we want to want to give is Scrub or maybe someone of you has her dirty cut It's a tool where when you have dirty categories as the as the old names suggested and it allows you to get For example, if you have a data frame to get a fast vectorized Transformation of your data frame to use in machine learning models So scrub is not yet released But it's a work in progress and you can there's already some information and I think you can try it out Good in summary our solution would give you for the customer something like this So it's just an output format in our case So our controlling department would get for a given report some some information how many failures do you have and so on and it would then list your failures in your data frame and Also indicate them visually for example here We we would observe that the opening data is in the future, but you already have a revenue Which is something that is an anomaly in our case okay, and in conclusion we have shown you that Outlier detection and solving those small data problems in controlling is a very important task That it's important to like not assume a perfect world when you when you get in your project And we have shown you a couple of methods how You can detect outliers and how they can be integrated. Thank you Thank you, Alexander Lucas for such a great session We are out of time so we can just have one question if anyone has otherwise Speakers will be out for any more questions No one seems to have any doubt Okay