 Hello, so I'm Jacob and I want to talk to you today. Thanks for coming first of all but I want to talk to you today about quality assurance in deep learning data sets and I'm a senior AI researcher at Colabora Let me say a few words about these guys because they sponsored this work. So I like them so we are doing We are consultancy we provide Full integrated open source solutions for you know products for whatever you may be building we cover multiple domains from multimedia to bootloaders the kernel augmented reality as well will lead the modado project We are so yeah, we lead the development of several important technologies And we can also help you get your charges mergers up changes merged upstream So you don't have to keep spending resources on your vendor branches forever So that's what we do. We also have an AI team. It's small, but you're good and We encountered this problem that labeling is like a very difficult task and It's for sure requires a quality assurance process like you cannot just label data and assume it's going to be okay and Contrary to a lot of research you can find Not all errors average out and can be ignored like if you have errors in your data that that's systematic These biases will transfer to your model and it will make the same errors that were in your data set So there is a lot of research on like random errors in deep learning networks And if you have like if you train for long enough all the are random errors will average out and the model will be perfect But if your errors are systematic, this is not the case And we think that the problem is really People are doing QA, but the tooling for this is not amazing so we looked into a couple of solutions for Labeling and ask like, you know, we looked into open source solutions check them ourselves we called a few companies that do labeling tools and most of them have only minimal support for review and And What this means is that the QA process is actually more difficult and expensive than the like the labeling process So there's an example a quote from a keymaker. There are like a leading annotation services provider I they got recommended to me by you know, my friends and They were very happy with them But they say that annotations are reviewed four times in order to confirm accuracy So two annotators laborer give an object label a given object and then their supervisor checks the quality of like checks Both these annotations and makes the decision if they are good or not that's a lot of work and So here's an example from a high-quality data set and can you spot the mistake? Not really. Okay. Let's make it a little bit bigger. Is it easier to spot the mistake now? Imagine you have to do this for tens of thousands of photos And that's not really, you know an easy way to do it the question is can we do better and There are a couple of ways you can approach this problem. So a lot of matters I've seen are using, you know Really clever statistical techniques to reveal the errors So you train some model you look at the outputs you try to figure out which are of them are unlikely to be correct and this works but we found that there is a Surprisingly easy way that is completely complementary to these methods that makes it a lot easier to do the quality assurance process so Technically this image has all the information you need to find the error, but it's still pretty hard and Our key insight is to transfer this problem by bringing similar things together. It makes it easier to spot the error So now It's pretty easy to see that among these, you know, 60 Speed limit signs, there's one that says 30 and yeah, this error is much easier to spot now So if we go back, you can actually see that the label on the 30 speed limit sign was said limit 60, right? But how would you spot that you would have to spend like an extraordinary amount of time on that? So the key site is that if you show similar images together Whatever you want to do with the data is like so much easier And it's reducing the cognitive load because you can focus on one thing at a time and you know all like get See similar things and focus on one one task and also like all the 400 categories you have and you know on every photo It helps you highlight the like real-world variety of the samples So you can actually see a big sample of your data set on a of a particular class for example and see like How are the different like what different you know sub clusters you could take out like what is actually what you're looking for, right? And in the case of traffic signs, it may be trivial It's not but like there are much more difficult examples when it's actually kind of difficult to figure out what's really the label means, right? especially if you're you know not deeply familiar with the data yet and One interesting thing is that you can actually do this if you have the labels So if you have the label data set and you have the you know the bounding box for example You can crop them and do it you can also do it using AI For unlabeled data, which will I will get back to in a moment like on the right You can see the example of deep fashion to a data set and It's sorted by visual similarity by our algorithm So another example These are random images from ImageNet of ambulances. This is the ambulance class and You can see that we got all different kinds of images like we got We got an image of an insider of the ambulance. We got people inside an ambulance We got ambulances from the outside. We got a military ambulance or whatever that is and if we sort them out It's actually much easier to focus right because we got similar poses and similar Things shown together and it's much easier to review them and see if they are correct or not. I believe There are no errors and ambulances in ImageNet, but I may be wrong But I spent quite a bit of time and tried to find errors. There are no errors on the ambulances. They were there was an error in red wolves They somehow a lion got there like baby lion got into the world rules But let's go back to traffic science because that's also an interesting case study so we looked at the Mapillary traffic sign data set which is a high quality data set released by Mapillary which they are currently a subsidiary of meta So, you know, it's a high quality data set They have a lot of resources to do it good and if you read the paper They actually did a lot of work to actually make the dates that good and we estimated the error Like we took a few easy classes like the speed limits and the error rate is at least 1% So I didn't review all of them yet, but it's at least 1% from the like free classes. I checked, you know, very deeply So it may be a lot, maybe not, but we'll get back to it in a moment. So here you can see the 80s speed limit And yeah, one stands out. It's actually a 90 Same here things of another part of the 80 speed limits. There's a 4.0 speed limit. I don't know what this means Here's another next interesting one so these are there are actually two errors here So you can see that these things happen in real world even with, you know, a lot of careful Careful work to avoid mistakes. It just just happens and We also do another thing because if you look at the data set in the paper They have this thing called other sign Which should be mostly about like information signs on a highway like which way you should go and things like that they don't have a Tricked like label, right? There's no label that says, okay, this is this is the way to Austin There will be too many labels. So they have this class that calls says other sign. It's actually the most Common class in the data set. It's like 90% of all the traffic signs are actually other signs And we took all these ground-truth crops from the Mapillar data set We trained a simple classifier Then we selected 700 Things that were supposed to be like other signs But our model actually believed that they are Regulatory no entry signs and Among this 700 images. There were actually 170 Other like there were 170 no entry signs that were actually mislabeled so These you know 1% error Estimation is actually quite low and this is going to confuse the model right because if you have like so many examples And it's such a big You know numerous class The model doesn't know what to do should it predict other sign or should predict Regulator in no entry, right and it will make mistakes, you know statistically according to this so This yeah, I think it confirms that this is a serious issue and even if you know a lot of Resources put into getting a high-quality data set. It still may you know it may come turn out that you actually have errors there We did another case study. This is a little bit funny We created a data set of men and women and t-shirts of various colors using image search so we went to Bing image search and We typed in like a woman in a black t-shirt for example and we got 20 glasses 10 colors You know men and women 20,000 images and an 80% error rate Actually, okay. What what does it mean? This is an error rate like what okay? So, you know, if there are some difficult cases like if this is really a t-shirt or not Okay, so some of them are actually difficult and you can learn like if you look at the samples if you like drill down You can actually learn a lot more about your data than you would imagine. I Didn't expect you know, there's so much variation in t-shirts But there's also like stuff where there's a t-shirt. There's no person in the photo, right? And I specifically asked for a person in a t-shirt not a t-shirt on a hanger You get like some other pieces of clothing you get dresses Whatever right so we get a lot of errors and if you do it like we did a lot of splits and 80% is a pretty accurate error estimation Why is this important? Does it they seem like a trivial trivial, you know thing to do to get to image search But in fact a lot of people are excited about Dali and Dali is this image generation model It is actually trained on image search, right? Like these crepe the internet they take text and the images around it and They build this like an almost data sets and you know It's not like they're not trying but yeah You get 80% error rate So the model has to have somehow figure it out and I doubt it is not affecting the You know the quality of the results so I'm pretty sure we're not gonna fix You know billions of images using anal fix, but yeah, that's that's that these are important problems So let me just do a quick demo and show you what what does it actually look like to use the the tool if we came up with so Here's an example. This is like the deep fashion date set Okay, yes. Yes. Yes. Sorry. I didn't said that. Yeah, sure. Thanks So, you know, I don't have a product to sell you there's no like, you know enterprise subscription for normal fix It's just an open source project that we you know, I thought that would be useful. So So here are some examples like there's a lot of variety in this right? This is the full data deep fashion data set without anything so we can actually But you can see how the interface works, but we can actually switch to our sorting method And now you can see that we actually get a lot of very similar images And if you for example wanted to you know, exclude images that have you know, a piece of paper in the background You can just drag and do it quite quickly Okay, I have difficulty seeing this but you get the idea, right? And oh here's another one, but if you look a lot and a lot of photos that are Visually similar. I don't know. I just did a lot of experiments and I find it much easier If I have like a lot of similar photos to make a decision and if I get you know Everything is completely random and I get a new completely different photo every time I just have trouble memorizing like the five or six rules. I want to apply to this to decide I let it is good or not. So it's like it's really actually a difficult cognitive problem. I think or maybe I'm just stupid, but Okay, then right now you can download a JSON file and get a JSON file with all the markings you did That's yeah, we're going to go back to that. So that's the idea That's how you could quickly scan through like a lot of images and try to remove the ones that have errors in them I know this looks pretty simple, but we really tried to search and we didn't find any other tools that do this Maybe you will you know, tell me that there are there, but at least I pretty sure there are no open source ones Okay, sorry now So let me like spend a few minutes to tell you how this works So you can imagine how the now the interface works. It's a browser interface and Browsers are really good at showing people images. So Good platform to develop such a tool But on the back end we actually do a little bit of deep learning to do the sorting and I think it's kind of interesting So we start by pre-training a resident 18 model with barrel twins There's actually a paper from Google that says that did the stupid area model the better for visual similarity So they actually recommend the resin at six, but I would have to implement that from scratch So I stuck with the resident 18 it trains very quickly because we don't want to train it for a long time because it It starts to get worse Barlow twins is the I don't know if you're familiar with that. This is like a contrastive learning So we take a like we take a batch of images like 16 or 32 images We augment them distort them every one of them in two different ways and then we basically train the model to predict correlated To output correlated features for the same image that was distorted in two different ways in Contrast to the all the other images in a batch that are not of the same kind So it's basically one against every everybody else and it works quite well There are you know multiple methods that does this parallel twins is nice because it's very simple But there are other methods that on based on the similar idea and you could you could use either of them The nice thing about this is this completely unsupervised. You don't need any labels. You can just apply it on any photos you have And then we take the feature vectors, but not the ultimate ones. So Normally a resnet outputs like a single vector of like 250 six numbers or 512 floating point numbers That doesn't work that well. We we figured like we tried this and not amazing so we actually go back a little bit and take the second to last layer which in case of 224 by 224 images gives you a 14 by 14 grid of Vectors each one contains 20 256 floating point numbers and then We actually do a clustering so we take K means which is like a standard unsupervised clustering algorithm And we ask it to like, you know make only a thousand and 24 classes So each of these you know long feature vectors get Compressed to a single number that says, okay, this is class. This is the cluster number, you know 500 And you can see on the right that it works. She works quite well The t-shirt for example is a single Cluster the colors each color represents a different cluster. So like the hands are clustered together the both the right and the left hand Did the the arms? Sorry the hands as well the jeans the face is something that's also similar to the mall for the model So this works quite well you get some strange artifacts here If you're interested, yeah, I actually dig into it a little bit And it looks like resin that is not really like it has edge artifacts So the the features to get on the edges of the image are much different from what you get in the middle I have a notebook that lets you explore that but that's not the topic here and then we take these We call them visual words because yeah, that's what the literature called them This is like a technique borrowed from the pre-deplaning days And we take these visual words and if an image contains At least one example of a visual word. We will set one in a you know, 224 bit factor So this is basically what's called back of words back of visual words It's a little bit different because we only set the values to zero or one because In these deep learning techniques you get a fixed amount of features for every image I get exactly 196 features So if there are there's a lot of background, I don't want to you know Have you know 20 in the in a single position because all the all the like that there was a lot of background This actually makes it worse So we came up with this idea to just have a binary whether the feature was in the image or not Not how many activations it got and then we do a very simple thing we start with some random image And then we find the next one that has the highest count of common visual words So yeah, we do an element wise multiplication and a sum to calculate like if there's a one in in the Bag of words vector for the first image and the one in the bag of words vector in the same position on the second image We get a one in the result and we sum them up and as you can see This is an example from the from the t-shirt data set like we get really good results like no this lady here in the green shirt actually like we grouped all of these together and Also, like the last two teachers are pretty similar and the same on the bottom row like you can see the visual similarity between These examples right so it works quite well It has some limitations So the biggest one is that the model it's completely unsupervised so we didn't provide any Notations and labels So we it doesn't really know what we care about right so it has no idea and it Sometimes it happens that actually it like not sometimes all of the times it happens It doesn't know where what is the foreground and what is the background object in the scene like there are techniques to do it But they mostly do it on video because then you can kind of guess like what the camera is following, right? There was a very interesting paper from Facebook about it, but if you have random like images still images It's kind of difficult to do it without a supervision So we could train a segmentation network But then it would be specific to a particular data set and we wanted to keep it unsupervised but that's a limitation and you can see on the right here that for example the gray boxes represents the like the patches of the image that are Matched between the top and the bottom image right so the top and the bottom adjust very similar But as you can see a lot of the similarity for the model is actually because of the background not because of the foreground So yeah, it's a kind of a balance and it's I don't have an idea how to fix it automatically But it still works pretty well, but yeah, you have to be aware of that that it sometimes it may be the background It's still useful even if it's the background because it still helps and it's you know, but that's a limitation so this in a you know an investigation and open source project to try to fix it and As I said, we don't have an enterprise prior plan to sell it. It's not a product But we really would like to make it easier to to use it So I would you know welcome any feedback you may got and any problems you may have with data quality That maybe we could we could try to help you and make this product more useful project more useful We want to do it basically like right now. It's a separate tool You have to you know prepare Jason files and stuff and we want to actually integrate it So you can run it straight from your Jupiter notebooks and it will be amazing if you could do it back and forth. So after you, you know Attach labels to the images to say which one are okay or not or maybe you have some different kind of classification you want to apply You could easily get these results back and still slice them, you know further and Jupiter and or in Python So that's one thing we want to do another thing is that as I said, we are not the only guys interested in this problem. So There are some you could call them sorting techniques invented by other people. So for example, one is Clean lab. That's like a research project and recently they I think started company around it they used Trained models to try to predict if there's an error or not in the data that looks really really interesting But they didn't catch the lion among the red wolves. I checked And there's also like the technique. I've seen it popularized by the fast AI community They do a classification classification model and they check which examples during training have the highest loss So the model has a hard time learning this. So maybe it's actually a mistake in the labeling, right? This works I've seen it work But on the other hand I tried it on the on the image third data sets And if there are too many errors, then it's no longer a good signal So I didn't like yeah, but it's also useful and one thing last thing is what I did for the Traffic signs so you can like check your confusion matrix and see like which class is I'm confused and these are greats like You know things to check if if there are maybe there are errors So like there are similar methods, but kind of different will it would be great if you could like Use it without too much trouble. You didn't have to write it from your from scratch There's also like an interesting thing. I Had to share this sorry There are deep learning clustering methods and one of them is actually Used in the Dali technique. I mentioned before this is like vector quantized variational auto encoders. That's a long name but the idea is that you can apparently you can teach a model to Like learn the features and it clustered them at the same time in a single end-to-end process And it basically means that you learn a code book. So that's basically the same thing as k-means On the conceptually the code book you just find the nearest neighbor exactly as k-means But the whole thing is you know implemented cleverly enough that you can train it end-to-end using gradient descent and They show pretty good results like you've seen Dali right the image generation Yeah, it looks like it really can capture the semantic information from the photo So we would love to train to try that and I think it should be able to we should be able to implement it in the method so in summary The Q8 ask I think it's very different and requires a new user interface different than the labeling user interface One thing I noticed it that it's It's good to improve both the UIs and the AIs of the system you're building So if you just focus on the AI you will not get optimal results because every time you have a new Every time you have a new AI system You have like a new frontier that you can make actually make a better user interface to make use of it, right and one last thing is that You could actually use deep learning in this case unsupervised deep learning to Augment what people are already doing like you don't need to focus on replacing people you can focus on Helping them do a better job and as you could see it's actually possible to do it and you can improve like the lives of the people Who are using your tools quite substantially? The image on the right It's actually created by by AI model, it's called centipede diffusion I Asked it for a robot cleaning the streets of New York which are overflowing with papers Yeah, that's what came to my mind and this is like a mascot of the project and I did a little bit of tweaking here and there, but that's a topic for another talk. Maybe someday Well, you could you can download this code that I used to present as examples From our github repository as I said, this is not a product. This is more like an invitation to discuss for discussion So I'd love to learn Like what kind of problems do you have with you know quality in your deep deep learning data sets and you know Maybe we can figure out how to make it useful and more useful for you know your particular product problems So we created a github community so you can have a chat. We will be there and you know ask any Like if you have any questions, we'll really try to help you. We have resources dedicated to that and So as I said, I'm from Collabora I encountered this problem on a few project I did both before joining Collabora and afterwards and they were actually so amazing to let me just work on it I said, okay, that's interesting. Please do like work for it for a couple of months. So yeah, I highly recommend them and Thank you for your attention. Okay so the question is if this was built on TensorFlow actually this is built on PyTorch and With the fast AI library. I just find it for me It was the easier to use but the techniques are of course applicable outside of this domain, right? Like you can use any framework to do this. So I tried a couple of things So the question is how many how many hedilinears do I have or like what's the optimal model? I guess for this kind of problem. So I can't try the couple of things and Actually the simpler the model and this is confirmed actually like independently was confirmed Like, you know, I came up with this conclusion and that it was confirmed in a paper from I think Google Brain There is a paper about perceptual similarity and they found out that the Smalled model and the shorter the training time the better it is at figuring out perceptually similar things and they made a proper research about it and you know created a set and everything So it seems like it's good because you can train these models very very quickly. You don't have you don't need a lot of resources but Yeah, so it's it's it's actually a very simple model and But it does work on all kinds of data and you know, it doesn't work on everything as as well But you can train it on a completely different data than an image net for example So and it still works right because you this is fully unsupervised so you can train it on, you know Any data image data you have I didn't thought about how you could apply this to NLP and And and it and it works it should work still, okay, so if there are any questions, then thank you very much and I hope I'll hear from you soon about your problems with data and how we could actually try to build a opus solution to help Thank you very much