 for having us here feeding us that's very nice first talk for me actually in Singapore yeah that's better now so I'm Alex I'm a data scientist at a company called data IQ just because I really want to understand a bit your interest could you raise your hand if you're a data scientist okay could you raise your hand if you're more a manager interested in data okay wow quite a fair number and if some people did not raise hands so are you more a business analyst or data engineers yeah okay I guess we have a majority here so just a disclaimer it's a quite technical talk that has some business value so it was business people should be happy has some technical details on computer vision so computer vision is a very old old-fashioned term sorry about that but it's actually still a thing today but don't worry we'll also talk about the new thing brave new world of AI so how among you how many have heard about the company data IQ okay okay a few hands so I'll just take three minutes just to introduce us and why we are speaking to you tonight we are a software company we want we've been trying to solve these four issues for the past six years first one is that there's not that many data scientists I could see that in the room tonight so we're trying to make data scientists more productive second one is that the most important activity is data preparation which is quite as a scientist I find it rather tedious but you cannot do without it but you can be faster at it third one is that if you work with data you know there's there's a lot of cool technologies but all these cool technologies are made by open source researchers they all have different goals combining them together can be a challenge so our platform is based on different open source technologies I'll just name drop a few scikit-learn Hadoop things like that and we want to make it simpler for all of them to work together and the last one which is nowadays six years later down the road our main focus is getting all these model these nice Jupiter notebooks not just a nice Kaggle leaderboard but making them into the production system of companies and we are talking about companies with quite established read ancient production systems so how do you integrate the brave new world of AI with the world of production just you may wonder what does data I could mean so it's it's a French company by the way so for some reason French people are really fan of Japanese culture these are a few metrics about us yeah I joined around 50 people we're almost 200 now we're still hiring if you're interested you can speak to me after this and tonight I'd like to talk to you about a practical use case that we have worked about upon using our software there exactly I'm gonna stop the advertising right right there there's nothing specific about data who from this slide onwards we're gonna talk about a universal problem for mail processing so some of you if have you worked with with mail processing or OCR before okay one one of you okay so some of you might say it's actually a solved problem there are all sort of OCR systems to scan letters etc there's been research and there's been a production system on this since the 50s but it started with pretty specific simple use cases like I want to recognize the postcode or I want to recognize the number on a bank check on a specific square that is identified within check and maybe you're going to tell me oh I have the iPad Pro or maybe I have the time the poem pilot and the poem pilot had optical character recognition I could have my pen on the palm and I mean it's I think the palm pilot is like 25 year old I didn't have this my dad had it and you could write down with your pen it would recognize the letter but actually these are pretty defined smaller cases of character recognition because you're taking assumptions like the the fonts are this way or it's written only for numbers in this square box or it's online so you know what is the separation between the words if you want to do general mail processing so like letters anything it's still an ongoing research case so we tried to have a shot at it this is what the kind of before after analogy this is very real and this is very much today even though it's a black and white movie this is what it looks like you recite will you receive a company receives hundreds thousands of letter every month and has to triage them to get them to the correct service because people when you write to a company you don't really care oh I'm I'm going to send it to customer service of this business unit no you're just going to write I want to send it to this company and the person in charge of the mail triage has to deal with that this we have made a pipeline for actually an insurance company that's pretty recent ten years old 200 employees and which was receiving a natural fair number of letters so between 800 and 2000 letters a day so business value this cost them a hundred thousand euros per year because they were outsourcing that to another company whose sole job was to open the letters and determine which business unit of the company to send it to was it accounting was it marketing was it customer service and it was humans behind it so our task was to automate the repetitive task and as a data scientist it makes me quite positive about the future of AI because in the end of the day it's really about putting this really tedious work out of people so they can focus on more important things I hope your phone's okay all right how are we on time ten minutes okay so I will dive but not too far into different parts of the pipeline but essentially we've had to dealt first with four challenges first one is how do you differentiate between handwritten and type letters how do you deal specifically with the type one and then how do you deal with the handwritten ones the handwritten there's even the issue of detecting the words and then getting from the scan so images to the actual words in our alphabet so symbols what did the raw data look like so that was the data that was promised a rather clean mix of identified type letters and handwritten letters but actually this is us before we receive a promise data this is the real data that we received actually was not such a clean mix there was a lot of things in between the handwritten and the type one it's because the outsourced company they had no stakes in their process they were just paid to to do that so they would receive a letter and they would scan literally everything if they received a hundred page leaflet they would scan all of it they would even scan the embo so we needed to separate all these documents first so we build a web app you can build web apps in our product for the purpose of labeling that so we actually label that ourselves and then we did a a pretty old deep learning technique called auto encoder so if you know what encode auto encoder is okay so very simply if you know a bit the theory about neural networks the trick is that the input of a network is the image itself so RGB here and then the layers narrow down and then you try to reconstruct the image so essentially a auto encoder is a compression mechanism it's exactly that and if you take the middle layer this is a what we call a latent space representation so you go from an RGB so three layers times the size of the image so you go from a million of dimension to a much smaller latent space which is a good representation of the image so that's what we call by compression so here examples of what it looks like this is the real image and then it's the decoded version so there is encoded and decoded so you see that the decoded one is quite blurry it's because it has less information but it's good because thanks to this reduction you can go from a million dimensions to a few hundreds and you can use this dimension to train a traditional machine learning approach based on these dimensions so here you have the confusion matrix so this is the way you assess the quality of your algorithms for tab a script so something that is typed manuscript or over over is like an envelope or something that we don't want to deal with then how do you how did we deal with the type case the type case this one thanks to all the research that has been going since the 50s this one is pretty much a solved case or at least 90% solved there's something cool that's used by superheroes and by data scientists called test act and test rack is literally three lines of code there's a team from HP which is now handled by Google which maintains this path this test set library which you can use from Python and apologies because that's very French but if you did read French you would be able to tell that it's actually a almost perfect extraction the only thing that didn't work too well was the signature to this because the signature is handwritten and very quickly the quality of the classification for type letter is very very good just based on this test of rack three lines of code but now let's speak about a harder problem so actually the research community had because it was harder and because they are very proud of it they did they don't call that optical character recognition they call that intelligent character recognition so ICR and so we used again it's it's thing that exists in the research we didn't invent an approach we just adapted to our problem first of all boxes with traditional computer vision methods traditional computer vision methods are still very much working and important today and then from these boxes we applied new deep learning methods so computer vision who has heard about computer vision okay so I hope that doesn't lose you too much I'll be happy to take question I will stay after this but a couple of techniques we used something called a cross dilatation kernel to extract the paragraphs then to detect the words we use the same convolution to look at the vertical density so as you can see here if you look at the pixel density so we we went from RGB to black and white because at this stage it's mostly about contrast and you can see that boom boom boom you have a density that drops between words so we use that to separate words and we had a lot of tweaking to do but it we managed to get it working now what about I promise some AI so what about the real deep learning here so we used a combination of the two best cell types of layers which are commonly used in deep learning called a convolution and a long short-term memory so a recurring recurrent neural networks with convolutional one so convolutions CNNs are traditionally used for image processing and LSTM are traditionally used for natural language so it's pretty natural that in order to get from an image to a word you combine the two so from the input image you extract features using a CNN you reshape and then you have a lot of LSTMs which gets so the output layer is a word we use Keras and TensorFlow which are integrated into our platform in order to deploy that architecture which is pretty deep I'm just dropping it here there is a very hard issue with this problem it's because pretty simple some people do a like this some people do a like this but the CNN is handling the problem in blocks of fixed size so CTC is a way that you get these blocks of 10 pixels down to a world to one world so it collapses the a ppl e to a etc to one given world as it's really smart if you want to read one paper on this this one is a really great read we had to we didn't have enough training data so we combine a lot of different sources which was a bit of a challenge and we applied a an interesting trick called curriculum learning where we first learn the network words of four then five words then six words etc if you throw your networks directly words of ten letters it won't really work that's why you get a kind of like oh I'm this is like I'm learning the four words and then like five oh five is hard and then getting better so you have this Caesar shape and you're converging then to a network that kind of understands words so it in the end we it was a lot of work but just to get you an idea of where we got out of a hundred letters we were able so we put the envelope etc out of scope out of the both the type and the hand written hand written and typed we managed to correctly say we know about 78% of them and out of the 78% where we said okay we are sure about what we're saying we were able to correctly classify 90% of the cases so in 90% of the cases we could say this letter should be sent to customer service or to marketing here go some concluding remarks it was a tough but really fun project that we put in production actually I should say it on my side we deployed pretty recently at our client so they will start to be able to compare it with their human process etc there's a lot of technical gritty details that we didn't talk about and I hope that this will generate some natural byproducts which are more interesting like natural language processing etc yeah just a quick if you have an interesting project like this drop us a line we have a team of data scientists which are interesting into looking at deep compelling and risky use cases like this thanks and we have I think a few minutes for questions so it was actually it was actually lower but because there is 20% where we are not sure so we still ask them to have a look at it so we'll be able to really compare once they have it for a few months in production side by side but for the ones where we're sure 90% is slightly is a few percent higher compared to what their team had yeah it's a good question in my opinion as often as possible but the real issue is how do you get labels to get that feedback but at least on a weekly basis or ideally daily if you can get if you can build us a software like a web app to get that feedback back to the model okay I think I have it here so the accuracy of the manuscript is is actually lower than the type if you do the math here and the real issue is that there's a few cases where we were not able to like we don't have a high confidence in our models so we use ranges so when we have a strong score like we're gonna say you can bypass the human but otherwise we say we still need a human the loop but these numbers are for another problem it's for the problem of triage of letters so it's not about the problem of detecting correctly the letter detecting correctly the letter is actually pretty simple so you can vary nowadays the systems are at 99 or something like that accuracy but here there is room for interpretation because sometimes even the letter even if you extract the words correctly it's not very clear that you could you should send it for counting or marketing there could be some blurry lines for one go it depends on which which layer normally when you you train it for a first time you train all the layers but then there are some layers that you can freeze and only retrain the last ones if I remember correctly it was the full the full pipeline was taking maybe a day to train something like that but today I think we are we're trying to do that on a monthly basis now the project is a tasting phase so we have tried it once we want to see if it if it works if you compare side by side and after that we'll have a go-no-go or on the frequency of a retraining ideally on a daily basis is should be the best the best thing because anyway the triage process is the way we get the labels and there is also some cases where people mistake so the outsourcing company they send it to the wrong department but then they have to correct it so that's when you get the true label so there is a time component due to the human process on how when you get the true label yeah yeah yeah we used indeed we use the many data augmentation technique I have a few yeah yeah we we used us like I dropped the names here but we didn't have enough data training data to start with so we had to do a lot thank you