 So thanks everyone, so my name is Piotr Rybak and I will talk you about my dream I had for a few years now because I decided that I want to build the mobile application to recognize LEGO bricks. So, okay, why would I do that? Why would I want to take a photo of a LEGO brick and get the part ID? To be honest, because it's fun. But also it has some practical application maybe practical. You see, there are a lot of LEGO fans, adult LEGO fans that buy used LEGO collections in mixes. So like someone sells their old LEGO collection and there are all mixed parts in one box and they do this for two reasons, many. One, because they are businesses and they sell those parts by individual parts. So say, okay, you want to buy this particular brick, I can sell it to you. But if they buy used bricks, they have to sort it out. Or they are more LEGO fans, they take those mixes and recreate the original vintage sets and sometimes also sell them but sometimes put them like in their own private collections. So to do that, you actually have to go through all of those bricks and recognize all of them. And it sounds like a trivial task like how many LEGO bricks there might be like a few hundreds or something like this. But actually no, there are over 70,000 individual LEGO pieces. So it's kind of surprising at least it was for me but actually LEGO has been on the market for like seven years, eight even more. And there are a lot of vintage parts like some one particular element that was only available in one set in US market in 1950s. And it's difficult to recognize them, no one knows all of them. And there are a lot of patterns like for example, minifigure torsos. There are like 10,000 different minifigures. So it's also difficult to recognize them all. So hopefully there are databases of LEGO bricks like IMDB but not for movies but for LEGO. There is something called Bricklink and Bricklink is the biggest marketplace for selling individual parts. So if you want to buy individual parts or sell them, you go there. So they obviously have a database for bricks. There is also a brickable, the second one. They are also a marketplace but they sell the instructions for LEGO sets. So if you are a LEGO fan and you build something really cool, you can prepare the instruction and sell it to other fans to build them. But the problem is that they only have a text-based search. So if you know what is the name for a part, you can easily search it. But if you have this part in here, any idea of what this is? It's tractor chassis, obviously, right? So I hope this is a little like understanding how we would want to do this. But how? Okay, obviously use some machine learning, some image recognition. And whether I start a new machine learning project, I like to start with a validation set to represent what I'm actually going to do and to have the way to assess my progress basically to see if I'm going in the right direction. So the validation set should represent our task as closely as possible. So our task is to make a mobile application, to make a photos of the bricks, to recognize them. So obviously, if I had this application, I would just take the images from my users but I don't have this application, I'm building it. So I have to bootstrap it somehow. So fortunately, the problem I set is not imaginary and there are a lot of people that are searching the web for solutions, what is this piece? And they ask online, even like there is a bricks exchange for this purpose, people ask what the, sorry, is this brick and actually some people answer them that this is brick number 33286, brick round one by one with flower edge. So obviously we can scrape this site and get a really nice validation set. And I would advise not to do this, why is that? Because obviously real life data is really messy and if you scrape it automatically, you will get a lot of corner cases that will not work good enough. For example, people will ask for a, and put a photo of multiple parts or multiple instances of the same part and you don't want this. You want to have clean pictures of one part per image. They will put renders and select what is this one little part? And again, you don't want to do this in a validation set or even they just put some drawings or renders. And I mean, okay, they are looking with this photo, with this picture for a part, but maybe this is not the real life application of our mobile application because we expect that people will just make a photo with smartphone and not put this picture like this. So obviously there is no other way just to do it by hand and it's really beneficial, not only because you have clean data but you also understand what is this data about. And my trick to do this data annotation is to do procrastination driven annotation. So every time I want to go to Twitter and just spend my time doing nothing, I just annotate few examples. And it sounds stupid, but I can do like 50 a day, 100 a day examples. And after a few months, you get pretty decent data set. So okay, so I spent a few months. I have a validation set. How does it look? There are basically three types of bricks there. There are some really vintage parts like from 1950s. This is the base for a garage because actually Dego used to produce small little metallic cars, like matchbox cars. And you could build the cities and garages with Lego bricks. There are some minifigure torsos or something like this, basically parts with decorated elements. And these are notoriously difficult to search for because how would you search for this? Like I don't know even for a keyword except for yellow torsos and it's not really distinguishable. And there are some photos of common parts but actually very few because people know those common parts. This is slope one by two, 45 degree, obviously. Okay, so we have a validation set and now we want to train a model, some model. So we need a training set. So we can Google for some training sets and actually there are a lot of them. If you just try it, there are like, I don't know, 50 at least. But the problem is they are mostly renders. Some are pretty poor quality. Some are better. Some are pretty weird. Like what is that really? What are those colors? They're not Lego colors, but okay, maybe they will be enough. But the problem is that even the biggest one, this one, only have the data, the images for 600 parts. And we know that we have 72,000 of available Lego parts so it's not really enough to train a model on this. So even if we go to Eldrow, Eldrow is a repository for 3D models of Lego bricks and we make our own renders, it will not be enough. Why? Because many people tried that before and many people trained the model on the renders and it's difficult to go from these 3D models and renders to real life photos and still get a model that will perform on these real life photos. So the only two data sources I came up while googling was from guys that were building the Lego sorting machines. So these are two examples. And obviously I've mailed them to see if they can share their data set but suddenly they couldn't. One of them deleted the data set. I mean, who did it's the data set? Come on, of Lego. But it was still beneficial to contact them because they had a really nice idea how to prepare fast those data sets. So basically how the sorting machine works. You have some Lego bricks. You put them into a treadmill, one brick at the time. Then you make a photo of this brick on the treadmill and then you make some prediction and you classify this brick and you put into some separate boxes. I obviously don't have this mechanism but I don't need to because if I put a lot of bricks of the same kind into this treadmill, I can make a lot of photos of the same brick and just annotate the bunch of them just once. So it will give me fast a lot of pictures of the same brick. I obviously have to have already the sorted Lego bricks but I'm a Lego fan and I have those. And each drawer here is one individual parts so I can just build this sorting machine without the sorting part actually. Just pour the Lego bricks from one drawer, pass it through and get a lot of pictures. So this is how it looks. You have a ferris wheel in here. It will rotate indefinitely on each this platform. There will be one Lego brick and we'll see it will go down in here to the platform with white background. I will make a photo, put the brick back again and go on and on and on and on. So it was able to generate around 40 images per five minutes and I will just stock the bricks, run it and if after like 10, 15 minutes collect the photographs, change the bricks. So this is how the data set looks like. I started with a small batch of 500 parts. This is like one third of my collection but I decided, okay, it will be enough. I have 15,000 images. Let's start with that. I also thought, okay, but we have those renders, those 3D models, I will make renders after them because why not? Let's see if the people are actually right that renders doesn't work. So again, for those 500 parts, I created renders but then I realized that it's actually pointless because Rebreakable actually lets you download the renders for free just like that as a zip file. So I use them instead and we have some data. Okay, but it's not enough. We have only photos for 500 parts. They are real photos, not renders, so it's cool but how actually to predict those 70,000 different pieces? So the approach is not the classification because obviously if we have examples for 500 classes we cannot extend it to 70,000 classes. We'll use something different, something called metric learning or similarity learning and it will work like this. So we start with a photo of a brick. We'll put it through the ResNet. ResNet is an image classification network but instead of getting the classification we will get some representation, some vector representation for this part. We'll take the render for this part, put it through ResNet again and get some representation again. Then we'll take a photo of some other part, some random negative part and we do again. So we will have three representations and then we'll train a model to pull render and a part together and render and a negative example apart. And if we train the model this way we'll train those representations to basically create the similar representation for the same part and different representations for different parts obviously. So this is the training and how does the inference work? We start with a photo, we only have one like we are taking a picture, put it through ResNet, then calculate the similarity to all pictures in our replicable renders and return the prediction of the most similar one. So the benefit of doing this is that we can actually predict all 10,000 classes from replicable renders, not only those 500 we had in our training set. So how does it works? Renders terribly, okay? We confirmed that people were right, renders doesn't work, but if I use my photos it works similarly terrible. So I basically spent a few months, sorry, a few weeks building this machine, taking the photos and it doesn't really work. I was kind of depressed then. If I combine two approaches it's better but still not enough. So obviously it doesn't work. The idea of creating photos is wrong and it's wrong because it has a really low diversity. It will work for creating a sorting machine because you always have the same lighting conditions, the same angle of photograph, but if I want to have a general application for anyone to use, it's not enough. And also still there are a lot of missing parts, like there are 72,000 parts, but only 10,000 replicable renders. And even if I use all 3D models from Eldero Repository, there's only 17,000 of them. So if renders are not enough, we can do something different and we can crawl one of those websites and get all the product photos for each of the part basically. There will be often terrible, like there will be often renders and one for each color. So we have to aggregate them. A useful trick is using perceptual hash. This is kind of like hash MD5 but it actually preserves the similarity. So those images are similar. The hash also will be the same or similar. We will also need to aggregate similar parts because believe it or not, those are actually six different parts with six different IDs because they differ here in details. For example, oil and oil all have different bugs. So we'll aggregate them together to make it easier for the model to learn. And we'll run some OCR on images because some of them, because the parts are similar but different, they will have these photos that are comparing two different sides parts with the written IDs of those parts. So we run OCR and if I found some ID of the part, I will just throw this picture out of trading set. And I have to make this check, this if for the part ID in OCR because obviously there are some bricks with text and that's okay. So at the end I have around 90,000 images for 66,000 parts. The bad news is that only 12,000 of them have more than one photo and we can only use those for trading. How does it work? Pretty good, like almost 50%. So what can we do next? We can focus on those negatives. Obviously if we compare the simple brick with a door, it's easy for the model to distinguish them. But if we swap this random negative to something more difficult, the model will again learn better to distinguish those two. So how to actually sample those hard negatives? Well, for each part, we have a name for the part and we can search for a similar name. Part with similar name. For example, brick two by four. Similar will be Duplo brick two by four and we will get a hard negative. These are some examples and it looks pretty okay. Like they are really similar parts and it actually increases our performance by almost 3%. So what next? This is actually the point when I was confident that the model works and I can do some hyper tuning and I did like I've done a lot of data augmentation, loss function, hyperparameters. All in all, I've trained 81 models and increase the model performance by 20%. But the problem is I still use only those 12,000 parts for training and I don't have anything for those. I use them in index for prediction, but not for training. But I do have the names for the bricks, right? So we can actually use this in our setup of metric learning to not only calculate and train similarity between images, but also between texts. So we will actually put together the one text encoder, the image encoder and pull together the text and the image for the same part. We can do this for every part. So when I was hyper tuning this approach, I was actually getting pretty nice results, but for a normal training, it was like only one percentage better. And sadly, if I actually use all those images or those parts, it actually had the performance. So this results is still again for this 12,000 images. Okay, so by the idea of this text image similarity isn't new. OpenAI published something called clip a few years ago and this is exactly the same, like similarity between images and its captions. So I decided maybe I will use this model and fine tune it on Lego bricks. So our, my basic approach was ResNet trained on ImageNet and then fine tuned on my dataset. It was 70 percentage points. If I use vision transformer, this is what they used in the clip, trained on clip, it's actually better. So the approach worked. I was very happy, but this is obviously about science. I changed two things, model architecture and the data that was, it was pre-trained on. So if I actually use only vision transformer pre-trained on ImageNet, it goes up to 79%. So it's great, like the model is starting to work better, but the single metric doesn't tell us what's wrong. Like it only tells us, okay, I get some results. And the best way to know what's wrong with the model is to look at the predictions at the errors, but actually it is time consuming, right? You have to spend a lot of time at we are training hundreds of models, you cannot do this. So the idea is to divide dataset into slices and to calculate the accuracy for each slice. So I will skip quickly through this example slices. For example, if we have the image that is nicely cropped around the brick, the results are pretty high. This is accuracy at one compared to the accuracy at 10 previously. So the results are not comparable. But if I have the not cropped image, the model doesn't work. So what can we do with that? Well, we can spend one day annotating data with bounding boxes. You can actually get over 1,000 bounding boxes in one day, take any of the shelf detection model, train it and actually increase the performance for this data slice significantly, like for 14 to 58%. Another idea, let's look at the angle where we are taking the photograph of the part. This might be typical angle or some atypical angle like from the bottom. Obviously the product photos won't have this atypical rare angles so the model doesn't learn them but we can use renders here because we can render parts from any direction. And if we add those renders to the index, it helps the model. And if we put them also in a training, it's still significantly helps. So overall, we started with a baseline of almost 80%. We got 5% points increase by adding detection and another almost two with rare renders. And these are some cherry peaks of the final model. This is for example, handle, we have photos from below low quality and it correctly finds the correct part. It even works for pictures of the parts, not really the photographs. I'm not sure why would anyone look for this because they actually had that part in here. But it works so okay. If there is like poor quality of the photo, low lighting, it's okay again. And this is my favorite example. This is a torso from Star Wars and this is the back of the torso and you can see the hood and still we found a correct torso by only having those front side of the torso. So to summarize, these are all the models, 200 of them that I trained to get to this point. I started with heart negatives then I done some hyper tuning. I think something around here, I submitted my proposal to the talk. So you can see it gives you motivated like to have something to tell you. I've done some text similarity, vision transformer, detection, renders and that's all. Thank you very much. Thank you very much for the presentation. We have a tiny amount of time left if there is a question from the audience. So if there is anybody who wants to ask a special question about this, please raise a hand and we have a risen hand. Please rush to the microphone and ask a question. Please, like here and you can also. Okay, so thank you for this talk. It was very engaging and cool. And I want to ask if there's any chance to let's say play a little with this app. Is there any perspective of deployment or something like that? Yeah, so basically it was supposed to be deployed but actually it seems like deployment of machine learning models is difficult. So yeah, it should be live in like few weeks. Okay, cool. So if you write to me on this email address that you cannot see totally, I will send you a link. It will be called recognize.com. Cool, thank you. Thank you. And the next question please. Hi, regarding the index I wanted to ask if maybe some of the accuracy is because, and here is the question. Is the index always finding and scanning through all of the samples or is it approximate nearest? This is actually all samples because I have like only 100,000 examples, like photos. So this is really small and you can do this like, I'm not sure, 20 milliseconds I think or something like this. So it's not a problem to scan them. Okay, thanks. Okay, that's all about our time we have. Please have another round of applause for Peter. All right.