 Okay. Hi everyone. Thank you for coming and for waiting a bit. So today we're going to talk about teaching computers how to see and how this impacts your business. Let me start with an image. We, as humans, we could understand this image at a glance. We see Obama making a joke and there are a lot of elements in this picture. We see different people, we see people in the mirror, we see people smiling and one a bit confused, right? And there are four elements that tell us that is a joke happening in the image. These ones. So now we want to train machines to understand the reality. I think this is a very complex task. So let me start with one question. Do you remember how you learned to see? We have to teach machines something that we do like at the first glance. So how can we teach something that we can barely explain, right? You will agree with me that teaching is hard. You have to comprehend to translate it to other people. So we all agree that vision is something that is easy to do but hard to explain. And we want to teach computers this vision. But let me ask you a question. Do you know how we learned? We learned from zero to three years old at exposure of hundreds of millions of images. We learned by examples. So this is what we will do with the machines. We will show the machines a lot of examples so they could learn how to see. So here the point in the machine learning is that you don't have to program the specific stats to a computer to learn. They learn by example and they extract their own logics. And what's the most amazing thing behind this is that machines nowadays are solving tasks easy to perform by humans but very hard to explain like seeing or detecting objects. So in this talk we want to talk about computer vision, why now, how it works, real examples and takeaways and the future impact. So let me start for why now. So there are three elements in computer vision. First are images. Second, computing power. And third, machine learning. So the evolution of these three elements until today is what make this happen. So first, images digitalization. We have images digitalized. We are not in the 90s with Polaroid cameras. Second, the decrease of the disk historic price. As you may know, image are heavy files and internet. It's a great source and also we have tools to outsource and to tackle these images. Second, Moore's law. Every two years for computing power, it's multiplied by two. We also have new processors. We have GPUs. We have GPUs that make all these things more easily and fast. Well, more fast. And also we have cloud based platforms so you can rent. So you don't have to spend a lot of money in buying all these expensive stuff. And third, machine learning. Machine learning now in deep learning is real, right? And why we are talking a lot of deep learning? Because deep learning outperforms other learning algorithms in the amount of data. And now we will need a lot of photos to train an algorithm. But this algorithm outperforms the rest. And now we are in the big data era. And this is why computers are now ready to see. So why now is an opportunity? First of all, the world is more visual than ever. This is a portrait of New York Times a few decades before and this month. I think now it's more visual. Second, computers need to understand this real world in order to be part of it. This is what a self-driving car sees. And third, companies are investing hard. So this is an image about how it's the total fundraising in US for 2017 for AI startups. It's around five billion. And in computer vision now we start having this fundraising trend. So please, Kige, could you tell us how it works? Thanks, Carlos. Great, so 1816. The first type of camera was invented. That was a great moment in history of inventions. Humans were now able to capture a moment in a piece of paper. 200 years later, we are now able to digitalize what we see. All this information is stored. How is it stored? Well, we can zoom in a part of the picture. We can zoom in again and we can see that the computer is saving a bunch of numbers. It's a huge matrix. And each cell in the matrix represents the intensity of each of the colors, which are red, green, and blue. All this information was there. And in the case of the human eye, the perception is quite easy. You can recognize this is a rook, a chess piece, really quickly. But this is not easy. If you think from the computer's perspective, this takes a lot of work. So the brain can do this in a few milliseconds. It's pretty amazing. It actually works with two million parts. It's the second most complex organ in the body after the brain. It's so powerful that it can differentiate between one billion colors approximate. And it takes 13 milliseconds for you to be exposed to something to recognize it. As Carlos was mentioning, it takes a bit of time for us to learn what are the things. It takes six years of your life to recognize something. Because we are able to recognize 30k, 30,000 visual categories at that age. They are not few. They're quite a lot. I'm going to jump from what the eye sees directly to the brain is the most interesting part. I'm going to simplify this a lot for those that might know how it works internally. Once the electrical impulses arrive to the brain, the first part is going to take care of processing the edges. The most simple part of the image. The second part, a bit more complex, is going to try to find some features. Then, out of those features, it's going to try to compose some shapes. And when you have millions of shapes in your life, you've seen a lot of them, and your dads have told you, this is a truck, this is a cat, then you can recognize. That's the fourth stage. We invented nothing. Actually, Velcro was invented thanks to a Swiss engineer that perceived how his dog would get attached these balls from the bullock plant in his legs when arriving back home. Or the sonar that was improved thanks to looking at how dolphins hunt between a lot of bubbles, a lot of noise. The bullet train seemed like the bullet was the fastest thing on Earth, but it wasn't. Maybe it was, but not the most aerodynamic one. The hummingbird had a better shape to have a better acceleration. So we got inspired from this bird. And then the tape. We look at the lizard, at the paws of the lizard, millions of rows of hair compose their legs and their ability to be on ceilings or walls and stay there. And that's how we came up with tape. We invented nothing. Probably you've seen this a lot of times throughout yesterday and today. Those professionals here maybe even a few, many times before. But there is a lot of similarities between what you're seeing here and what you saw before. Actually, the first part is the same. It's about edges. The second part of the convolutional neural network, I didn't mention what it is, but I suppose you know already, is trying to detect the features, then the shapes, and then the recognition, the part that is going to tell the user what he's seeing. So let me explain a little bit more about convolutional neural networks. Very simplistic. We get an image from the beginning. Remember those pixels, that huge array of pixels, of numbers? In the very beginning of the neural network, we're going to try to recognize the simple edges and shapes. Then the second part is going to take care more about, after filtering the image, more higher level features, abstraction. And in the very end, the neural network is going to say, hey, this is a rook because I've seen a lot of rooks and they are more or less of the same shape. This seems very simplistic and something we could do easily. One thing we can do with this, a lot of them, I'm just going to enumerate a few. We can do image classification, very popular thing. It has been doing for a few years now. It's trying to detect the main object in the figure, or the image we are sending to the neural network. An example, there is a balloon in this image clearly. We could do object detection. There is a lot of popularity with these techniques in the last 10 years. Not only we want to know what the protagonist part of the image, but also other things in the same image. So in this case, not only balloon, but also there are seven balloons in this image at these locations. Then, semantic segmentations. Hey, machine, tell me where are all the pixels where there is a balloon involved? Or keep on detection. We might be interested in looking at the orientation of an object. We need to know if the balloon is somewhere. Then we can do keep on detection. Or other example could be recognizing facial expressions, or even recognizing people. And last, there are many more, but we could also instant segmentation. The machine would output something like, there are seven balloons in these locations, and we can detect every single picture separately. Pretty interesting for self-driving cars and so on. So what do we need? First, we have to define the task we want to solve, obviously. Then we need the data. We need to have labeled data. That means we need to have a historical number of examples to tell the machine what we're trying to figure, what we're trying to machine to figure out for us. For example, in the semantic segmentation, we would need the XY coordinates for each polygon that composes each balloon. Once we have all these data, we train a model. That means we're sending all this information to the convolutional neural network, or similar. The model is going to learn from those examples from the past. We're going to evaluate it over and over again, many, many, many millions of times until it learns something. And then we can predict on unseen images. This is also a balloon. We expect the machine to predict that this is a balloon. Sometimes, it seems that it's very easy to know what something is. This is my cat, canonical example of something. Yeah, the cat is there, pretty obvious shape, protagonist in the picture, clear background, smooth and complete shape, good angle, good lighting. No problem. These are the typical examples we see a few years ago about cats and dogs. CNNs worked perfectly. 99% accuracy, everything was fine. Not a hot dog. You probably have seen this application. It was from a show. Obviously not a hot dog. Obviously a balloon. But these are also cats, not hot dogs and balloons. But they detected a hot dog, probably wouldn't detect a cat, neither a balloon. They are pretty cool examples because this is very obvious for our human vision system. But when it turns out we're dealing with a matrix, it's not that simple. Perception can get tricky, like the samples we saw before. I'm going to just put here eight examples. There are many more, but these are the main ones. You could have problems with lighting. Remember this image from the dress color? We were challenged all of us. What's the color of the dress? Wow, I see blue. Okay, this could be one thing to the extreme, but in general, it can happen more often than we think. The formation, the balloons from before, position, you can clearly see, I guess, it's a keyboard from a sight perspective. If the machine has never seen keyboards from a sight perspective, it would be quite complicated. Backgrounds, movement, camera placement, scale, occlusion, things are in front of the other. It's so difficult that we have Google sending us these pictures to validate who we are. We are humans. But what we are doing there is we are telling them where the roads are. And that's the training set. Incredible, eh? Sometimes it's so hard that it takes amounts and ages for the computer vision to detect where is the muffin and where is the chihuahua. Quite difficult as well. And sometimes impossible. The takeaway from this part also is that we should not expect from the computer to output something we cannot teach it, okay? We cannot label something there, put a bounding box and say, hey, this is something. We have no clue about what Bosco was trying to put in there. He just painted something cool. Some examples, Carlos. Can you show me something cool? Yes, in some industries. Yes, we didn't invent nothing. Bosco paintings are incredible. And let's see some real examples. Here we will talk about media, software and service, things that we have done and our learnings and takeaways on these topics. Later, surveillance, moral reflection about thinking about what could happen, this to the extreme. And later we will take a look on other industries. Yes, media, brands, brands are everywhere. And half of your spending is sent into the trash, but you don't know which half, right? So this is a project to build a real-time brand detector for market research. Which was the business problem here? So the first question is, when my competitors are doing market spending, right? So we can build a competitor's campaign monitor, viewability, auditability on ads. So the data set that this company had, it's a bit particular, right? Because they started as a company that they have online survey and you join and you answer surveys with them. And the evolution of this market research is to track browsing navigation of people. What does it mean? Imagine that when you're navigating on your browsing device, somebody could track and record what are you doing. And you can see if you see some logos, some banners and other things. So they have a worldwide panel of over 50,000 panelists there. So we get all this data. And we have to build a solution, an alarm system that tracks when somebody's impact by some other your competitor's brand logos. And we want a fast and pretty great. So we made some real-time image algorithm and requirements, right? It has to be adaptable to different inputs. Imagine that somebody's browsing. Browsing is very clear. But at the time a video came out and you want to understand what's inside the video, right? And we have to create a lot of thousands of artificial logos. So this is how our daily work looks like. It's not that dark as it seems. It's a little more bright. But this is what we gave to the machine. We gave to the machine's logo with different shapes, with blurry, different angles, different linings, different, exactly, and different colors inside the logo. So we could understand at a glance what an Hewlett-Packard logo is, but machine has to see it a lot and a lot and a lot of time. So how this solution looked like? Here on your left you have a corporate video of HP people having fun in the pool. What the algorithm is doing here is predicting in each frame where a logo could be possibly found. You could think a video is a set of consecutive images. So you're making a prediction at each time. He's sending the hat, sketching the logo, right? They look that they are having fun. And this is an advertising. So in each frame it looks for a logo and it shapes the bounding box. As Kiki were mentioning in the example before of the balloon, of the bounding box. So lessons that we learn on this example. First, from a business perspective, APIs are great ways to start. But you, from your point of view of your company, you have to take into account the future cost. An API, you share your data, you don't own your solution, but it's a good way to start interacting. As Oscar was mentioning in the big stage, if you're using the technology that other people is using, you don't have any competitive advantage. And second, one algorithm could achieve great things in context. So this is what we saw. And for a more technical perspective, we have to take care of real time predicting. 60 frames per second HD will mean making a prediction with a huge image every 60 times at a second. It changed the ratio from the past from today. And you have to optimize your systems. And real life images are very complex. Reality is very complex. We have been trained for years to understand it. We have to be, we have to have some empathy with the machines. Second, this is a software service from the automotive industry. I think this is an example. It's a bit beautiful from our perspective. Why? So we have a US Second Bay car company that started expanding. As you could imagine, the second car industry in the US is a big industry. We see it on movies. People imagine people with pickups in Texas the second hand. So what happened is the company started getting more successful at a time. So at the beginning, when people is, so they automate the ads to these dealers to make automatic campaigns on Facebook and different places. As you may know, this is not a very digitalized industry. So the first clients, they started tagging things manually. They had 3,000 of tagged images. So as the company went more and more and more successful, they had 300,000 car images and their number of images are growing and growing. So they have two chances. One, at the number scale, I have 100 more people and they are doing repetitive task jobs. Or second, I could train the machines to do these repetitive, easy tasks. So for this context, it's very clear that we need color, color detection algorithm, brand algorithms. We, there is a huge problem here that in second hand, you have the catalogs photo that is the classical BMW with a mountain and some trees and you have the dealer photo made by a human. And if you put as advertising the catalog photo, nobody want, nobody will click it. And the procedure, you know, the procedure here, it was human and machine in the loop. That's a good part. So that was the output here. So we have different algorithms to classify the image that they have. So this was a huge advantage because you don't have to hire that 100 people at this ratio and you could add great features to your product. Imagine that you're seeing, this is a photo made by a human, but imagine that somebody advertises a second-hand car with this image, nobody gonna click it. So which are the business takeaways on this project? First, we have a CTR increase above 15% because clearly we choose the right image for the right to show, okay? The right image to show. And second, the usability of the product. No more missing information. Since somebody uploads an image, all the information is configured on the fly. And second, more from a modeling part, you can use and we always encourage public resources. Image are everywhere in our search in Google. So you can also help the data that you have in-house with external data and also you can buy more data. And second, there are general-purpose architectures out there that you can tune, but some of the time they achieve a lower score on your problem. So here, from our perspective, a custom network scored a 10% more. So we highly recommend to go to see, to research some MIT Stanford paper and look for your own architecture for your problem. It also help a lot. And a third example, this is not hard job, okay? And I highly recommend people reading this article. What we are seeing here is surveillance, right? And in a computer vision perspective, we see an algorithm detecting humans, human box. And we are seeing a number on the top of the right. So this is the scoring of the algorithms. This is what is happening right now. So, which is the business problem here, you know? We have to, we want to control dangerous activities and harming people in our daily lives. We want a safety system for good citizen who said, who established who is good or wrong, it's philosophical, I'm not gonna enter. Second, the data set that we have here, we have a closed CCTV system that is started in China and we have 1.4 billion Chinese people. And the setup, right? From the examples that we saw with the balloons, we want to render face, we have a real-time solution, we have a key point detection zone faces so you can detect who it is because there is a lot of people there. A bounding box around humans and tracking while they are walking. So this is not more science fiction. So if you look on the left examples, so this is a camera that is following a woman and it's tracking her face while he was walking. No, and she's getting into this car, the plates are being recorded, right? And here is doing his, her groceries. So if you see here, she's buying, when she buys some Dorotis, the score gets upgraded. So this is not more science fiction. So one of the most future challenges that we will face in the next years, based on this technology, are digital dictatorships. And we are not joking. Sometimes you get a bit worried about GDPR, but if you look with perspective, it's something that we have to take into account. Computer vision is a great, great, great technology that we can build amazing things. But as we all know, a great power requires a great responsibility. So much more industries, as the computer vision could impact much more industries. Here in healthcare, we have predictive diagnosis, right? So data scientists, we have a Superbowl tool. So last year, we compete in platforms. Most of you will know it's Kaggle. Last year was about, can you predict if these scans, these patients from with these scans will have cancer in the future? So imagine all the things that we can do with the machines analyzing the stacks, these CT scams, also having in the surgery camera. And the great point is that computer did it better than doctors, detecting if the person will have cancer in two years, two years ago. Why? Because in a small amount of pixels, they could see patterns that are barely imperceptible for human eye. Second, automotive, this is the more mainstream application that we have right now. Why? Because we see it running down the streets, right? And maybe we will feel a bit threatened by self-driving cars. I think we, as humans, we are a bit unperfected while we are driving. And which is the data source that these cars have? So the dash cams, the Google Street information also, and the capture that Enrique was mentioning. And there are a lot, a lot more, more applicabilities on this because cars will start talking to each other and seeing if there is extreme, extreme condition on weather, if there's road flows, et cetera. Next industry, marketing, content understanding. So social networks are more visible. And people barely read that the reality and the things that we are impact are more visible than ever. So Facebook was saying that video is the new king. Yeah, but who is going to finalize all those videos, right? Who is going to see all these videos at scale? It's barely, isn't it? One, I think one minute in YouTube of upload videos, it's equivalent to hundreds of hours. Who's want to check what people is putting on there? Machines, obviously, but also we have other, other, other use cases, right? Retargeting based on web content. No, web content, it's a visual thing that we percept and we could do something about it to have more better retargeting to our audience. And last but not least, so claims automatization, no, in insurance and in banking. Imagine that you have a crash and automatic, you took a photo and it's, there are companies existing doing this. So imagine that automatically you know where is the, where is the crash and the cost. This is very important for insurances, for insurance companies. So because they could tackle fraud from the future and it seems that it's less fraudulent that you have a picture at that moment. So just think where a camera is found or a human is seeing, there is opportunity for computer vision. And Kike, which are the key takeaways and how the future look like? Thank you Carlos. Okay, so now you might be thinking about your own use case in your company. You want to, you want to think about, hey, I have some images. What can I do with them? What are the thinking steps, the mental model? First let's think about the framework. This is a typical question a lot of clients come to us and ask. Should I use an existing API or should I use something open source and create my own product? This is a tough one and it's very subjective but what we honestly say normally is that if you are going to play around for a little bit you're going to do your test, go for the existing API. But once you scale, once you want to custom your own thing, if it's core of your business what you are doing with images, indeed go for your own project. And honestly speaking, the best projects using computer vision are open source. The second question is normally, should I, well, this is more in the past few years, but should I set up my own servers or should I use a cloud? Well, again, if you have some regulations or you have some cases where your information should not be flowing around in the internet, then you want to avoid some bunch of a legal team then go for your own services, otherwise rent it and avoid the cost and maintenance. Go for the cloud. Second thing is the data. So rule of thumb, a few hundred images per label. Remember our label in our case was the balloon or the cat. So we need 500 images per label. Cannot be, you could tell me, hey, but they cannot be the same images, right? Different context and different situations to avoid the typical visual mistakes. If you have multiple labels, then all of them should be well represented. Then yes, so training data should be as similar as possible as what the machine is going to see in reality. And this is because computer vision is not amazingly good by now. That means that we should have, we should think if we have a, let me give you an example. You have a security camera placed somewhere. You want to do something with that security camera. If it's going to be, if it's not going to be moving around, then try to use training data from that security camera. Third thing, the model. As I said before with the Bosco painting, machines can do what humans can teach. This is important to notice. Let's not have over expectations. And another thing which Paco Nathan said in the big data Spain last year, human in the loop, very, very important. We should be thinking about semi-supervised processes. Not everything is automatic all the time. Let's not do manual things all the time and be in the seventh century. Let's give feedback to the system. Let's look at the weird and difficult cases. Let's use and retrain the models with that data. And last, very important, expectations. As I was saying before, don't have over expectations on what you expect from the machine. You need to train your teams. You need to train your company. You need to let everyone know what we've been saying here before and some of the many things. But in fact, when you have people that are over trusting the technology, you have two situations. One is the winter AI which happened a few years ago. We had a lot of opportunities and good ideas, but we could not execute them. Therefore, the society lost faith on AI. And looking at the future, this is first Amazon's webpage. I like to relate this to computer vision because at this moment in history, people had no idea what was going to happen with Amazon. You can see something old, not very fancy, but hey, this is Jeff Bezos sitting in his first office. This is the richest man on Earth right now. So think about computer vision this way. Takeaways. I don't want to have anyone here living without keeping these five things in mind, okay? Accurate computer vision comes from three things. Data, the models, and the computation. Second thing, there is no one-size-fits-all model or technique yet. This is what we're trying to achieve in the future. For now, we have to have different models for different tasks and we have specialized models. So let's not try to do everything at once. Third thing, images are fundamentally ambiguous and complex. And we are used to our human vision system, which is amazing. But let's not think the computer can do the things we can do with our two million working parts in our eye. Fourth thing, it is a great moment to start working with your images. There are many, many tools, many ideas out there, many tutorials, everything, or mostly everything is open source. You've got everything you need. You even have free credits in Google Cloud. You can go and play or you can have someone playing and so on. So you should start doing it if you think it's important in your company. And last but not least, computers are going to be around us every single day in the future and they're going to try to help us in every single task. And for them to understand the reality, they need to be able to see. So we need to teach them how to see. Thank you very much. We still have five minutes for questions. We have questions. There is one there. Just one question because we're running late. Okay. I was thinking during the presentation that you say that we have seen a lot of millions of images in the first years of our life. But the thing is that humans are not learning with images. Humans are learning with videos, right? So you are seeing an image in action, right? And what is going to happen with that thing that is movement? And we are relating that to emotions and things that happen to us. So I would like to understand your opinion on this way of if it's, if we are already training images with videos, I think you get the idea. Thank you. Really good question. So this is my opinion. I'm not a doctoral scientist or a neuroscientist or anything. But I would say there is something even more important than the video and it's the context. So what we see is true that it's a series or sequence of frames, but we have a context. What happened before? What happened after? And during the frame, even if we had a machine that could have one frame per second, we have a context. If I take a picture now, I see people, and if I have a 360 panoramic view, then I know I'm in an event. There's a big screen, something else. But the machines are focusing on one single part of the picture and trying to make sense out of those multiple activities, right? So what you're saying is completely true. The AI systems now and in the future are aiming at looking at a sequential amount of activities. So in the end it's having a context. So knowing more things about the frame you're capturing, that's very important. So in videos, the current state of the art algorithms in videos are not doing that yet. Not the ones we can see in production. They are just going frame by frame and looking at the context in the picture but not at the frame before or later. Okay, thank you. Thank you. Thanks. Thank you.