 All right, so my name is Alex Vinyals, software engineer at Sky Scanner. I work in the hotel's data team. And this is the team that delivers all the data that populates the website and the apps. And I'm here to present you the pipeline that we built to make sure that all the hotels have images in production. So one of the biggest challenges when you're a meta-search is to actually unify all the data. This is what the results page for a meta-search looks like. And here you can see all the providers for the W Barcelona hotel. Here you see the three cheapest ones, but in reality there are many more of them. So we are getting the data for that hotel like 20 times. And our job is to make sure that it reaches production once and it has the best possible data. This is what it looks like. Every partner that you integrate with gives you access to the catalogs. Those catalogs give you all the data for the venues that the provider has. And they give you usually like slightly different names, slightly different street addresses, and different coordinates. So our job is to match those venues. So do some magic and make sure that they reach Sky Scanner once and with the best data. If we do that for all the hotels, that's Sky Scanner in the catalogs, we are going to end up with the data release and this is the Sky Scanner catalog. So what about the images? In this case here you can see a search result and there's no image there. So you as a customer are not going to look at that hotel. So images are a critical piece of that product. Let's see how do we make sure that they reach production. The case is pretty similar. Every partner gives you all the images that they have for that hotel. And they usually give you like really similar images. You're going to see like corrupt images or slightly shifted different colors. And our job again is to make sure that we pick the best one and put it in production. Here you see the images in the results page. Take here you're going to get an expanded gallery. This gallery shows you the images with more detail. And if you look closely, you're going to see here down below, duplicated images. So this is what we are trying to avoid. And as you can see here, they are quite similar images, but slightly moved or shifted or cropped. So yeah, it's really hard to remove them all. So with around 200 partners, around 1 million hotels reach production. This means that we are going to have to process around 35 million images. But there's a trick here. And it's that we resize. We resize for mobile devices. We don't want a mobile phone to be resizing all the images. There you see the thumbnails. They are smaller ones. So all the work is done in our site. We have around 40 different configurations for resizing. So this is going to multiply the number of images that we end up processing. So I'm sure to tell you the tale of an image processing pipeline. But before doing that, I'm telling you the tech stack. We are happy users of Postgres on RDS managed. We've been together all the steps of that pipeline with SQS, queuing service. And obviously for machines, we just get EC2 instances. We try to scale them so there's nothing to do there. There are no machines up. As far as libraries is concerned, we use Django. We use it with Django REST framework. Incredible to make APIs. We avoid using the Django RM. We use SQL Alchemy instead. And for Combo, for messaging, we use Combo. This library, you may be already using it, because it's the underlying library under salary. It's maintained by the same guys, and it's simply a lower level. When you do Amazon stuff, you're going to use Voto. And for image processing, we use Pillow, which is a nice library with C binding, so it gives you a nice performance for manipulating the images. And then we use Python 2.7 for technical reasons. So the tail of an image processing pipeline, we're going to start with the triggering. And there are like two big groups here. The asynchronous steps and the synchronous steps. The asynchronous are going to be running continuously, making sure that all the images are ready to be used by the synchronous steps. So triggering. The trigger basically is a small worker that keeps running through all the catalogs. It looks for URLs there and computes the div. The state is stored in the database, and basically we say, all right, this catalog has been updated, and there are URLs that disappeared. Those images should be deleted. There are new images or updated images. So this payload is computed for every hotel, or every partner, and this is sent to an API. The image release API. So the API, when it receives the payload, it stores that image into the database, and we keep things like the URL. We give an identifier to that image. We know which provider gave us that image, and for which hotel of that provider that image is. And now the API is going to move forward and queue the messages to the next step, which is downloading. The downloader step basically gets the messages from the queue, hits the partner's CDN, gets the image, and puts it into our system, into an S3 bucket, so that if the CDN fails or the partner removes the image, we still have it, and we can roll back at any time. So this is what the callback looks like, what the worker looks like. Basically, we get the image URL, hit the CDN, get the contents, register a new key into S3, set the contents of that key, and after this is done, it's in our system, then we open the image repeal, and we ask some pretty basic questions like, should I filter that image? Is it picking off? Does it have resolution enough? If the image survives all of that, then it will go to the fingerprinter, because we are going to pull this message to the next queue. So notice here this reliable callback decorator. This is basically something we do on all the workers. We don't want to die on unexpected errors, so the main thing that this does is making sure that the workers don't die. Here we do an extra stuff. You see that this is converting a warning into an error. Pillowout really protects you from the compression bombs, and we actually got a few of them, so a worker, you don't want it to die for a compression bomb. You want it to continue going and move on to the next message. So here we catch everything, log it into a lockster or anything you're using, then we run the function and that's pretty much it. So this is what a worker looks like. It instantiates a consumer that is going to connect to the backend, and it's going to map all the messages in that queue to the callback that we just saw. Here you see the combo primitives being used. We just create a connection to the backend, and then we specify that start consuming all the messages of events in that connection, and we're going to map the messages to the callback. The callback is going to get the message and the body of that message. So we are going to process that message, and if everything goes fine, we just acknowledge the message and move on. So after the image is downloaded, we go to the fingerprinting step, a really simple one. We just download the image from S3 and compute some identifiers that are going to allow us to do further processing. The question that we will try to answer further is going to be if these images are the same or not. For a computer, those images are not the same, not really, because they have different bytes, different sizes. I mean, it's not the same image, but if you were a user and you saw that in the website, you would be disappointed, because obviously it's not telling you anything new. It's redundancy, and that's what we want to avoid. So yes, they are the same image, and this is what the fingerprint does. It computes some kind of hash. We use the image hash library. It implements different hashing algorithms. I believe average hashing, perceptual hashing, differential hashing. We end up using the differential hashing one, and we just do a slight modification. We're just doing a cropped hash thingy. So what does that do is it creates images of the image. It crops the image several times, and then we compute the sub-hash for each sub-image. The reason we do that is because we want to maximize the chances of matching duplicated images. So this kind of hash is such that similar images have a small distance between the hashes. So, time to duplicate. At this point, the data release is going to run, and it's going to say, okay, I have one million hotels that are going to reach production, and now you need to make sure that those hotels have images. So this goes to the API, and the API starts processing that, queues messages, payloads, and the applicator is going to update the status on the database, and then if it's needed, it's going to move to the next step, which is prioritizing the images. We're going to see that now. So what is a group of hotels? A group of hotels, basically, for us, it's just pairs of IDs. We identify a hotel as provider, and the ID of that hotel for that provider. So if we see here three pairs of IDs, it's a group of hotels provided by three partners. So that's what we call a group. And when I say if needed, this yellow line is a queue of messages in the duplication step. You see it raises up to about one million, and then it starts descending. This means that the workers are processing the payload, and the blue line is actually the next step queue, and you see not all of the messages go forward to the next step. So we try to be differential. If you do two releases, not all the data is going to change, so you can reuse the images that you already computed in the previous step. That's what we're trying to show here. So the duplication, the duplicator is going to wrap another group with all the providers, and it's going to fetch all the images that we have for them. And ideally, what we want to do here is identify that there are two image groups, the image group of the room and the image group of the pool. That's all we want to do. So the conclusion is I say to the other way that this group has those two image groups. And this is what it looks like. We have all the set of images to be processed, and we have no groups. And then we start moving. We see the group with the first image to be processed. We try to expand the group comparing it against the other, pending images to be processed. And we ask the same question always. Is that the same picture? And the arguments that we pass here are the hashes that we computed back before in the fingerprinting step. So what is this same picture doing is basically accepting all the hashes that we did, that we computed and comparing them with the Hamming distance. So if that distance is below threshold, we're going to consider the images to be actually the same or quite similar. So how do we tune that step? Well, how do we get a guarantee that tomorrow we don't break everything and in production we have tons of duplicate images? So what we do is you build the corpus. So basically you grab a big sample of images and you go manually over there, over them and set up the groups manually. So you set the truth. You say those images should be a group as a human. And then what you're able to do is run the automatic algorithm and get some metrics. Those metrics, you can further tune the code and see an improvement or not improvement in the metrics. So you keep doing improvements until you are happy. And that's how you have guarantees that you don't break the stuff. And then we go to prioritization step, which is choosing the best images. Now we know which groups of images we have, but each hotel. Now we need to pick the best ones. So prioritizing is quite simple. Just date the status in the database and what it does is it says, okay, I have two image groups. I'm going to get all the images again and I'm going to pick the best one from each group. So it says, okay, this is the best image and this is the best image. We are cool based on decision on pixels or resolution colors, histogram of colors. And once we know that, we're just going to sort them and say, if we have something to base our decision on, we're going to prioritize them accordingly. Sometimes partners tell you, this image should be the first one. So if we have this data, we're going to use it. And then that reaches production. Of course, that will reach production. So what could go wrong when you pick the best image? This could go wrong. Yeah. So obviously picking the best image is not an easy task. It's really hard. And yes, this image had tons of colors and tons of pixels, but obviously it's not the best one. So we have sort of an MVP to extract features from images. And obviously if we found a word that says, oh, there's a toilet here, we will put that image on the back. In the meanwhile, because doing that, it's complex, so in the meanwhile we have tools so you can go manually there and sort those images and fix them manually. It's not that it's happening like at all times. It's really specific cases, so that's why it doesn't have that much priority. And now you prioritize the images. You know which ones are going to reach production. It's time to waste time resizing them on all the sizes. So this worker is going to get the payload, make all the sizes, and put them into a bucket in S3. This special folder is going to be served through a CDN, so you get reduced latency on the website and the apps. And that's pretty much it. This is what the worker does. When you have an image with pillow and you want to resize it, it's quite easy. You can just call resize, and you're done. If we make the image smaller, we want maybe to improve the contrast. You can create image enhance, use the image enhance primitive, and you can just change the contrast as you want. And that's it for the pipeline. This is the final result. All the images have reached production and how they are progressing through each step. And this is the schematics of how it looks like on Amazon. Basically, we have a distributor machine, which is just deployment support. We keep all the dependencies there. We try to get the wheels in there, so the other workers are going to use to retrieve the dependencies from here. This saves time when getting heavy packages because the wheels installing them is really easy. And then we have the image release API. Basically, it's used for the data release, interfacing with the data release. And then we also have a health check there that checks the queues, and if there's, like, tons of work to be done, it's going to spin up the instances, triggering a CloudWatch alarm, and that's pretty much it. Then we have the auto-scaling group of workers, and everything is connected to the database, which is the central piece. The auto-scaling database, Amazon is quite easy to do that. You just provision more IOPS if you need more throughput, so we are happy with that. And then it just reads the final packet with the CloudFront on top, and that's pretty much it. So thanks for listening. So if you have any questions... Yeah, like, the guy is gone. Can you repeat? Do you consider using database events instead of a triggering process to scan the database? No. Like, it hasn't been a problem that much. We have maybe 80% usage, and we are not using, like, that big of an instance. Okay. Have you ever seen using one of these fancy, much-hand learning techniques to classify images? Yeah, yeah. Yeah, so we have kind of an MVP, but it needs more work. But basically, you got, like, basic words of what has been detected in that image. Like, I see a bed or I see a toilet, and that would be fair enough to get rid of those issues. But it needs more work. Times where you have the hotel room that shows all the features of the room, but has a bad resolution, and another one that has a great resolution that just shows the back. Do you have a way to... No, usually we try to trust the partners. We know that Leonardo is a good provider of images, so if we have Leonardo, we're going to prioritize Leonardo. If we have the sequence of the image, so we say a partner is telling us, hey, this image needs to be on top. We trust that. And as a last resort, we just check resolution and the history of colors. So if we have only blue color, we are going to ignore that image. It's quite good, so... We have here the metrics of the corpus. So the most important metrics here are completeness and pureness. Completeness, basically, we try not to put a wrong image. So if we have a group of beds, we don't want to put a toilet there. So this reduces completeness because we filter more groups of images. But the 91% of group of images that we have, 97% are pure. This means that of all the groups that we get, 97% have a similar, like, the same image. Like, all beds or all bedrooms. So, yeah. Could be better, of course, but... Yes, yes, like, the MVP for... You mean for recognizing features? Oh, no. Yeah, it's pretty easy to answer. We just have the workers, which are independent of Django completely. We had the workers there, and we knew that we were not going to couple them with the Django API. So why implement the models twice? We just choose to do it in a SQL alchemy and not pass the Django settings to the worker. Just technical decision. The question is if... Yeah, yeah. Well, you get the process of API, you get hyperlinks. We're happy with it, using it. It's not that... The APIs may be the less important of the components here. The workers are much heavy. I guess that's it.