 Well, first of all, thanks for coming. Last year, I was here talking about how to use actor-critic reinforcement learning algorithms to make malware undetectable against anti-virus software. And this year, I have been lucky enough to be a speaker again, so thank you. Well, I'm going to introduce myself. I'm Ruben Martinez, and I've worked as a security pentester as an ethical hacker for more than 10 years. But I've been interested in artificial intelligence since I was at Polytechnic University. And I'm currently working at Datahack, specifically in the area of labs, developing a research project called DIAFORA. And what are we doing in this project in DIAFORA? Well, we have to give intelligence to a humanoid robot. It's a paper robot. You may have seen it in the Datahack booth, or in TV, or in Masterchef maybe. So it can assist people with Alzheimer's disease. And some of the skills that a robot must have are autonomous navigation, speech-to-text, text-to-speech, emotions detection, combining sources of voice tone, message meaning, and micro-facial expressions, and facial recognition capabilities. And this last feature, the facial recognition capability, is what has motivated this talk. And a robot must be able to recognize a known patient and must have a quick method to learn the identity of a new patient using only one picture of that new patient. This is a very important point. And let's start. It's time to see the menu for this talk. Well, let's start by talking about how we have faced the problem of recognizing faces. Next, we are going to see how convolutional neural network works. These algorithms will be the bricks to achieve our goal of identifying faces. And at this point, we are going to see a very important thing, some pitfalls of convolutional neural networks. CNNs have become in the standard way to develop artificial vision projects. Because of that, it's good to know what limitations we are going to find if we work with these kind of algorithms. And in the third point, we are going to see how to make more robust and generally stable convolutional neural networks. And one of the hardening techniques that we have used to reach our goal is known as a spatial transformer networks. And once we have seen all the basic tools with which you are going to work, I'm going to present what kind of neural network architecture we have used to develop this facial recognition model. And what kind of loss function we have tried to minimize to carry out the training of this model. And once we have seen all this theory, I'm going to show you a video of the robot using this facial recognition pipeline. And in this video, the robot will have to recognize a person from whom the robot will only have one picture of his face in its knowledge base. But that person won't have a period in the training set of the facial recognition model. OK, this is the key. And you could ask, what is this knowledge base of the robot? It's only a folder in which we are going to place one picture of the face of each person that the robot will have to recognize. And we're going to label its image with the name of the owner of that face. And I would like to finish the talk by showing a new path that we are going to explore, spiking neural networks. Is there anyone who has ever heard about spiking neural networks? No? Yes, one, two, three. OK, that's right. I'm going to start in, I'm going to be in your first time. So a good point. Well, this new path is where we believe that artificial intelligence will advance if we really want to achieve what is known as artificial general intelligence or a strong AI. Well, what we are doing now using deep learning models are known as weak AI. We only can solve very small problems in very limited environments. But spiking neural networks are more inspired by biological models based on neuroscience than deep learning models. Well, so let's start. Well, how have we faced the problem of recognizing a person's identity? Well, we have followed that pipeline. The robot will capture using its frontal camera, an Asus X-Tion Pro with capture depth, an image of the scene in front of it. And we are going to use a machine learning model to detect its face present in that image. And now we are going to make a crop of this area. And we are going to pass this area to the convolutional facial recognition model, which will be in charge of assigning numerical representation to that face. And this numerical representation will be known as embedding. Well, our model will have to compute a distance between the embedding of that face and the embedding of each face present in the knowledge base of the robot. And once we have done that, the network will have to determine which embedding of the knowledge base of the robot is the closest to the embedding of the face captured by the camera of the robot. And if that distance is less than a certain threshold, then the model will return the identity associated with the embedding of the knowledge base that is the closest to the embedding of the face captured by the camera of the robot. Otherwise, the model will return the value of unknown identity. This is the pipeline. In this point, I would like to talk about one thing. All of this that we are doing here is to ensure that the thing that we have to use a lot of methods to make our more robust convolutional neural networks, such as data augmentation techniques or, for example, transfer learning, which is a must if you work with convolutional neural networks or the spatial transformer networks or modification in the convolutional operator to add it more transformational operations. All of this is to ensure that the data distribution of the training set or the facial recognition model is as similar as possible than the data distribution that the model will find in real environments when it has to make predictions. And you will think that these kind of algorithms have poor generalization capacity. And you will be right. Because of that, we still need big data. If you want to work with supervised learning using convolutional neural networks. Well, and to develop the detection of each face present in an image, we didn't want to train a deep learning model. But we decided to use a supervised machine learning algorithm that is quite robust in this task. And its name is hard cascades. Well, this model receives an image. And for each face present in that image, it will return the coordinates of the bounding box that surrounds each face. I'm not going to explain these algorithms because the time of the talk is limited. And I prefer to delve into the part of the pipeline in which we have used deep learning models. So as I said before, next we are going to make a crop of the area of the bounding box that surrounds its face. And we are going to pass this area to the facial recognition model. And this facial recognition model is based on a convolutional neural network. So let's see a brief introduction. Well, convolutional neural networks combines different types of layers. And among them, we can find, for example, the convolutional layer and the pooling layer. So how does the convolutional layer work? Well, first of all, we are going to see how to represent an image. We are going to use a matrix of h pixel of height and w pixel of width and c channel conos. For example, grayscale image will only have one color channel. And color image will be represented using red, green, and blue color channels. And in the convolutional layer, we are going to slide these orange matrix over original image, the green matrix, by one pixel. And this value is monadid using a hyperparameter called sestrite. And for its position, we are going to compute the element-wise multiplication between the two matrices. And we are going to add the multiplication outputs to get the final, the interior, that interior, which will be a single element in the output matrix. And in that terminology, the orange matrix is known as a filter or kernel or feature detector. And the output matrix is known as a convoluted feature or feature map or activation map. And in practice, a convolutional neural network learns the values of these filters during its training process using back propagation. And the point is that we still need to provide some hyperparameters before of the training process. Hyperparameters, like for example, the number of filters, the filter size, the sestrite, the padding, which consists of filling the edges of an image with zeros. And here, you can see the equation to compute the feature map size using the values of that hyperparameters. And summarizing, the more filters you have, the more image features get extracted, and the better your convolutional neural networks become at recognizing patterns in unseen images. Well, now in this picture, we can see an image of our entire convolutional neural networks, which is hierarchical learning. I mean, the filters of the fields layers are going to detect patterns or simple patterns like lines, borders, coordinates, and something like that. And the filters of the final layers are going to detect complex patterns based on the knowledge of the previous layers. This is the key point of this slide. Well, now it's time to talk about the pooling layer. Well, the pooling layer reduces the dimensionality of each feature map, but retains the most important information. It's like a generalization technique that has a convolutional neural network. We can find different types of spatial poolings. For example, max pooling, average pooling, zoom pooling. In case of max pooling, we are going to define a spatial neighborhood, for example, a 2 times 2 window. And we are going to take the largest value of the feature map within that window. Or instead of taking the largest value, we could have taken, for example, the zoom or the average or full elements within that window. And we'll have used it in these cases, the average pooling or the zoom pooling. Next, we are going to slide or 2 times 2 window. And we are going to take the maximum value for each region. OK? And in this point, I'm going to introduce the concept of a receptive field, which is present also in neuroscience in our visual cortex. OK? What is the receptive field of our neuron? Well, it's only the region of the stimulus space that causes the foreign activation of that neuron. And here, you can see the equation to compute the size of the receptive field in layer k, assuming that the size of the receptive field in the first layer is 1. But coming back to the topic of the pooling layer, there are a few downsize, which make it an undesirable operator, we can say that. For example, the pooling layer is destructive. It discards a lot of feature activations. Because of that, we are going to lose exact positional information. And the point is, the positional information is invaluable when we are working with visual recognition tasks. And another limitation of the pooling layer is that with a small receptive field, the effects of the pooling operator will be felt only towards deeper layers of the network, where the size of the feature maps will be usually small. And because of that, intermediate layers may suffer from large input distortions. And we cannot increase the size of the receptive field arbitrarily, because it will don't sample our feature maps too aggressively, and we will lose a lot of positional information. So here is where spatial transformers networks come into play to make more robust and generally stable convolutional neural networks. Well, spatial transformer networks are a different stable module that learns how to apply different types of affine transformation to the feature map where they are inserted. And hold this to ensure or to remove in this case a spatial variance. This is the key point of this technique. Well, we could define a spatial transformer network using its three characteristics. For example, a spatial transformer network is modular. I mean, it could be inserted anywhere in your network with relatively small tweaking. And also, a spatial transformer network is dynamic. I mean, a spatial transformer network applies different transformation on the feature map where it is injected in for its input example, as compared to the pooling layer that acts identically for all the input examples. Another characteristic and a very important point is that a spatial transformer network is a differential one. And because of that, it can be trainable using back propagation and offer us the ability to perform an end-to-end training of the main network when the spatial transformer network is injected in. In our case, the main network will be the facial recognition network. And we can say that a spatial transformer network has three components, a localization network, a generator, and a sampler. Well, the objective of the localization network is to return the parameters theta of the affine transformation that will be applied to the feature map where this spatial transformer network is injected in. And what is an affine transformation? Well, it's only any kind of transformation that preserves collinearity and that preserves the ratio of distances. For example, the midpoint of a line segment initially will be the midpoint of that line segment after the transformation. And we can find different types of affine transformations. For example, rotations, translations, scalings, transvections. And the thing is rotations, scalings, and transvections are linear transformations. And linear transformations can be applied to a point P of coordinate x and y using the matrix multiplication. But the translation is not a linear transformation because it doesn't leave the origin fixed. And if we want to apply a translation to a point P of coordinates x and y, we will have to use the zoom operator. And in this example, we can see how we have applied linear transformation given by these four parameters, a, b, d, and e, to a point P of coordinates x and y. And this linear transformation could be a rotation, or a scaling, or a transvection. And after that, we have applied a translation given by these four parameters, c and f. And summarizing, we can represent all these affine transformations using six parameters. This is the main important thing of this part. Well, and coming back to the localization network, the input of the localization network will be the input feature map of shape h, w, and c. And the output will be the transfer method output of six parameters, the six parameters of the affine transformation. And the architecture that we can use in this place could be a multilayer perceptron or a convolutional neural network, ending with a regression layer of six neurons. And here, we can see how to go to the tertiary flow of this localization network using a multilayer perceptron, in this case of two layers. And we can see that the final layer, the second layer, has six neurons. And the bias of this final layer is initialized using the identity matrix. Well, we are going to see the next component. Well, the grid generator. Well, the job of the grid generator is to return a parameterized sampling grid, which is a set of points, x of s, and y super s in this case, which will be the points where the input math should be sampled to perform the desired output matrix. Well, to obtain that parameterized sampling grid, first, the grid generator creates a normalized set mess grid with the same shape as the input feature map. The input feature map is called u, in this case, in the image. And after that, we are going to, in this case, we are going to apply some, this operation, we are going to take the output of the localization network. And we are going to multiply this output of localization of the localization network times this normalized mess grid, which will have values between minus, equally spaced, between minus 1 and 1 in each axis. And it will be represented using this notation, x super t and y super t. And the point is the values of the parameterized sampling grid, I mean, x super s and y super s, may be fractional. And because of that, we need something more. And this something more is known as sampler. Well, in this case, in this slide, we can see how to code this normalized mess grid, which will have equally spaced values between minus 1 and 1 in each axis. And here, we can see the sampler. The objective of the sampler is to take the input feature map and the parameterized sampling grid, which will have the coordinates where the input feature map should be sampled to produce the output feature map, to produce the output feature map called B, phi, in this case. And the sampler don't have using a bilinear interpolation. The interpolation step is necessary, as I said before, because the values, the output values of the parameterized sampling grid will be usually fractional. And because of that, these values won't exist in the input feature map. And we need to perform this kernel interpolation. And in this formula, we can see how to compute the output values of the pixel y, which will be in the position x of i super t and y sub i super t in the channel c. And the u value will be the input value of the pixel of the feature map at the position nm and in the input channel color t. And next, we are going to apply the interpolation kernel k to the values of the parameterized sampling grid coordinates where phi sub x and phi sub y will be parameters of the kernel interpolation. As I said before, we usually use bilinear interpolation in this case. Well, and combining these three components, the spatial transformer networks learns how to apply a different transformation on the feature map where it is injected for each input example. This is the main important thing. And the main network, I mean the facial recognition network, learns that some object, for example, a face is the same regardless whether it's rotated or scaled or so on. OK? Well, in this slide, you can see an image of the neural network architecture that we have used to develop the facial recognition model. And you can see what kind of loss function we have tried to minimize in this case, which is known as triplet loss. The triplet loss is a way to learn good embeddings for each face. And in the embedded space, all the faces of some person should be close together and form where separated clusters. And the objective of the triplet loss will be to ensure that two examples with the same label have their embeddings close together in the embedding space. And two examples with different labels have their embeddings far away. However, we don't want to push the train embeddings of some label to collapse into very small clusters. The only requirement is that given two positive examples of the same clays, also the same clays, and one negative example, the negative should be farther away than the positive by some margin. And to formalize these requirements, we can say that the triplet loss will work with triplets of embeddings. Well, we have used a ResNet convolutional neural network. And we have cloned this network two times where the input of the first network will be the anchor that will be an example of some class. And the input of the second network will be known as positive, which will be a different example of the same class as the anchor. And the input of the third network will be the negative that will be a different example of a different label than the positive and the anchor. And you can say that in this picture, I have chosen a picture of my face as the anchor and a different picture of my face as the positive. And I have to apologize for choosing this picture of my partner and friend Alejandro as the negative. OK. And here, we can see how to compute the value of the triplet loss, which will be the maximum between 0 and the value of that expression. That will be the distance between the embedding of the anchor and the embedding of the positive plus, minus in this case, the distance between the embedding of the anchor and the embedding of the negative plus some margin. And the most important part here is that as we minimize the triplet loss, the distance between the embedding of the anchor and the embedding of the positive will be closer to 0. And the distance between the embedding of the anchor and the embedding of the negative will be greater than the distance between the embedding of the anchor and the embedding of the positive plus some margin. Well, perfect. And in this point, we are going to see all this theory working in a robot with real data. As I said before, the robot will have a knowledge base in which we are going to have one picture of the face of each person that the robot will have to recognize. And additionally, we are going to train our facial recognition model with a training set composed by approximately 5,000 pictures of each identity. Well, let's play the video. This is the recording of the RE-FR-002 test in which we are going to show the face recognition model here I'm going to explain the environment to the person who is present in the training set. What it has done is to train a neural network that generates an embedding function. So when an image reaches that neural network, the embedding function will assign it a value. That neural network has been trained in a supervised way through three types of images. Let's see here what they are. We have three folders. In the folder number 0, there are my images. In the folder number 1, there are images of my colleague Alejandro. And in the folder number 2, there are images of my colleague Javier. Additionally, the model has a database in which we will place a single image for each person we want the model to recognize. In our case, we have an image of Alejandro who is present in the training set, another image of me who is also present in the training set, and finally an image of our colleague Rosa who is not in the training set. So let's describe the test. What we are going to do is that our colleague Rosa is going to place in front of the robot, it will say the command verbally that will activate the face recognition model and then the robot will begin to take images through its camera. Those images will turn into a topic of view and they will arrive at the model. For each image that arrives at the model, it will verify in the first place if there is a face in this image and if there is, it will take care of predicting that a person is present in that image. And finally, a model of text speech that will reportedly reproduce the name of the person predicted by the visual recognition model. Very well, we are going to start with the test with our colleague Rosa. If you can say the word Rosa. Recognize. Right now the model has already accepted the command and it is starting to capture images in short, it will say the name of the preface. As you have been able to check, it has already said the name of the preface. I will say it again, we are going to stop it to check. Rosa. In this case, we are going to continue to stop it to see if we can say it. I say it's coming out, but I don't have a clue. It's an example. Here we are now going to check the output of the model. Well, here it has been able to check how the model has saved for each image that it has received the name of the preface person in that image, which is Rosa. Additionally, what we do is publish that information on a topic of Rosa, called, in this case, Peperutils Guion Bajo Facename, with the purpose that another application can make use of this information. Until here, this test. Okay. This is a demo of this facial recognition model with all of its limitations. And I would like to finish the talk by showing that we should move towards models that combine concepts like quantum computing and neuroscience. One example, a first approach based on neuroscience could be spiking neural networks. Well, spiking neural networks work with spikes, which are discrete events that take place at points in time rather than continuous values. Well, essentially, when a neuron reaches a certain potential, it spikes, and after that, the potential of that neuron will be reset. At first glance, this may seem like a step backwards, because we have moved from continuous outputs in deep learning models to binary outputs in spiking neural network models. But in the end, the spike trines offer us the ability to process spatial temporal data, or in other words, real-world sensory data. The spatial aspect refers to the fact that neurons are only connected to neurons local to them. And the temporal aspect refers to the fact that spikes occur over time. And what we are losing with the binary encoding, we gain with the temporal information of the spikes. OK? And one common model used in spiking neural networks is known as the Leakey Integrator and Fire Model, which is a simple description of a neuron. Well, real neurons have a membrane and have a family of ion channels that control the flow of current across that membrane. Which modulates the action potentials, the membrane action potentials, including the firing of these action potentials. Well, the Leakey Integrator and Fire Model implements four features of a real neuron. Well, the entire cell has a single voltage called v sub m. And it has a membrane with a capacitance called c sub m, which its units are in farads or columns over volts. And it has a leaked channel that allows the current to flow across the membrane with a resistance of r sub m. Or inversely, with conductance g equals to 1 over r sub m. Well, and without the application of an external current, the charge carriers traveling across the membrane will be driven by an equilibrium voltage called v sub equilibrium. And the action potentials are simulated when the voltage of the neuron reaches a certain value, called v sub threshold. And when this neuron spikes, then the voltage of that neuron will be artificially reset to a reset value called v sub air. And in this picture, you can see a simulation of the structure of a neuron. And the equation that describes how the voltage of a neuron changes over time in the phase of an externally applied current, y sub m, assuming that tau is equal to the capacitance time the resistance r sub m, is that equation. The derivative of voltage with respect to the time is equal to that equation. And if assuming that we have a discrete time step, we could evolve that equation to work with discrete time steps. And after that, we should define some kind of learning rule. For example, STDP, which is the acronym of Spike Time Independent Plasticity. And this learning rules modifies the strength of the synaptic, based on the post-synaptic spikes. This is only a brief introduction of a spiking neural networks. We deem that some of you are attracted to this topic. As the time of my talk has finished, I would like to encourage you to keep researching in alternative ways of artificial intelligence to test more robust models. That's all. You have any questions? Thank you. If you want, we'll control later. I'll be in the Data Hub booth, OK?