 Welcome back. Now proving who you are can be surprisingly tricky. In 2006 a terrible car crash killed one girl, Laura, and left another Whitney in a coma. At the hospital they mixed up their identities and it was only when Laura awoke from the coma still covered in bandages that her true identity was revealed. Today we are embracing electronic IDs, which solves a lot of these problems, but the creation of electronic IDs requires physical presence. Or at least it normally does because our next speakers have developed an easy onboarding system possible from a smartphone. To tell us more about this we have with us Jesus Alonso and Kais Dai, senior data scientist at Tree Technology. Welcome guys. How are you? Hi. Hi. Fine. Thank you. Great to see you both. You too. So if you're both ready take it away. We're all ears. Okay. Thank you very much and thank you all for attending our our talk. So my name is Jesus and together with my colleague Kais we are going to present this talk on document verification. As commented we are both data scientists at Tree Technology, which is an R&D performing company for those of you who don't know us, providing information and communication technology solutions based on big data and artificial intelligence. We have a very strong R&D department in which we mainly work on Horizon 2020 and in the future also Horizon Europe projects we have taken part in all these projects and all these as well. And our commercial brand which is TreeLogic is involved in helping organizations in key sectors to improve their information systems and big data and information and artificial intelligence technology. And at the commercial side we also have projects with all these clients. So having said that this work has been done in the framework of the Impulse Project, which is one of those H2020 projects that I showed before. The Impulse Project, we are fairly large consortium composed of public administrations, universities, standardization bodies, technological centers and so on. More concretely we are all these entities and the main objective of the Impulse Project is to perform a holistic and multidisciplinary evaluation of EITs from different points of view. So for example from the socioeconomic point of view, from the legal point of view, ethical, operational, etc. And this is not going to remain in the just a research or theoretical point of view, but we are also going to take it to practice by doing a five, six case studies in five different countries, providing a variety of contexts. And we'll talk a little bit more about it later. So from with all the points of view mentioned before, there's also of course the technological point of view. And this gives us the opportunity to use disruptive technologies in this project. The two main disruptive technologies that we're going to apply are blockchain and artificial intelligence. Blockchain, we all know it, it's a really hot topic now. It started as a technological means to enable the operation of cryptocurrencies, but it's having a lot of different applications now. In this concrete project, we are going to apply blockchain to create a distributed ledger that is going to transform the personal data that is in our EITs from being centralized and owned by governments to being decentralized and owned by citizens themselves. And we are also going to apply smart contracts, which is another application of blockchain. Concerning artificial intelligence, which is what this talk is about, we are going to perform biometrics authentication and also document verification in the digital onboarding. So let's start from the beginning. What is in the first place an electronic identity or an EID? It's not so easy to define as it seems. It can be defined badly as a digital way of proving identity, which is analogous to how physical ID documents are an analog way of proving identity. And it can have a different degrees of formality. It isn't the same like, for example, a gym car, which is in itself an identity. You prove your identity as a customer of the gym, the services that you can use, then other type of more formal identities, which are used for official services, for official purposes. Of course, EIDs, as everything digital that we are having now, have emerged under the umbrella of this digital revolution that the society is experiencing during the last decades. EIDs as a really, as an official form of proving identity are already present in many countries and online services in the world. And we normally access our digital identity by means of something physical, such as, for example, a chip, which is integrated into a physical car, or a password, or in more recent types, maybe biometric features such as fingerprint or face recognition. But how can we create a digital identity that doesn't exist yet, and before we can use it? This is what the onboarding process is about. It's responsible for the creation of the digital identity. Currently, there are two main onboarding methods into, let's say, broadly digital services. One of them is the digital registration, which is normally made using an email address and a password. But this method is too weak for official purposes. There can be, it's not exaggerated to say that a person can have many different accounts in some services using different email addresses, or they are not secure enough. On the other hand, we have the physical onboarding, which is in itself robust and benefiting from classical security measures of IDs. I think why is going to an office and facing physical identification by, for example, a policeman or an authority? We made ourselves the question, what if we can have the benefits of both of them without it losing security or with it not being weak? And so having the benefit of an EID onboarding that is easy to make and that doesn't require a movie, but at the same time is robust and secure. And we came with the idea of a new fully automatic onboarding process, which is based on taking pictures of physical ID documents. These have several advantages. As commented, it is as easy and comfortable as digital registration. And it can be done with only your smartphone. On the other side, we also have some disadvantages such that we lose the security measures that physical ID documents in itself have and that are required to move the card or to see through the light, through the card pointing at the light to reveal some of these elements. These elements are, for example, holograms, elements that are visible only under ultraviolet light, elements with variable ink, microscopes, letters that move when you move the card. And the loss of these security measures makes this, in principle, this process sensitive to forgery and tampering of documents and allowing some people with bad intentions to, for example, perform identity thefts or spoofing. That's why we need document verification. And we're going to describe how we solve this problem. But first of all, this process has some caveats that we need to discuss. First of all, there is a really large variety of physical ID document models. There are several types of documents such as national ID cards, there are passports, there are driving licenses, there are other kinds of identification such as citizen cards. And inside every type of card, every country or every authority normally has different layouts. Even within the same country, there can be different layouts, different kinds of information that is written on the card. And this makes it difficult to be able to validate all kinds of documents. So we both face the choice that we maybe have to select the types of documents with which this system is going to work and at the same time we are going to make it adaptable to be able to treat different documents without making a different system for each layout. Another issue is that as the images are taken with the user's smartphones, the quality of the images is not uniform. It can be, for example, dark, it can be blurry, it can have a different resolution, it can have a different angle or it can present a shadow. And so we need to have a threshold in which what we can analyze and what we cannot analyze. And also this also makes us do it even more adaptable. And finally, the more difficult of all these issues is the data. Finding training data to train machine learning models for these kinds of verification is really difficult. The data is extremely scarce and it's hard to obtain. First of all, due to privacy concerns and the legislation, for example, the GDPR in the European Union, there of course needs to be a protection of personal information. And as such a publicly available ID document images that data set does not exist. So we had to build it ourselves. How can we do it? Okay, we need samples from ID documents retrieved by volunteers. And of course, respecting the legislation, the GDPR, and we need them to sign an informed consent in which we offer all the required security and privacy guarantees. Where this, there's a concern in the society that of giving personal information. And we all have heard about news about identity thefts or other kinds of crimes that were committed. But when someone had access to personal information of other people. And if we go to the case of having samples of forged or tampered ID document, the access to samples is even more difficult. It's impossible without having access to a police or a judicial database. And for the moment we don't have. So we also have to come with creative solution for this extra trouble. We are probably not going to be able to train binary classification models having samples, having enough samples, both of genuine and of forged ID documents. So we need to come as commented with a creative solution for that. So we have built a data set. How have we built it? Well, we did it in three stages. The first stage was a data set of only Spanish national ID documents, passports and driving licenses that was provided by volunteers from our company. I thank you. I thank all the ones who contributed to this because it was the first step. And it allowed us to get to work and to start doing things as soon as possible. For the second stage, we retrieved a data set of ID documents from the five countries where the case studies of these are going to be conducted, which are Spain, Italy, Bulgaria, Iceland and Denmark. And we're provided by volunteers who work for the for the entities, for the organizations that are members of the consortium of the project. And finally, for the first, for the last stage, we are going to retrieve a data set also of ID documents from the same five countries, but they are going to be provided by volunteers that participate in these pilots in these case studies. There's going to there's going to be two rounds of these or these case studies of these pilots and we are going to to offer the volunteers of the first round of pilots to provide us with a little more of data. As for the first ID documents, we needed to make something. So we decided to design and develop a simulator that tries to modify some ID documents to present the characteristics that forced ID documents have and that will allow us to to perform some kind of machine learning or classification models to distinguish between both. It is as accurate as we can. We have studied how forced and tampered ID documents are normally are and we have applied that knowledge to our to our simulator. So what do we have to do when doing document verification? We need to take into account two assessments. First assessment is that the user that sends us a hair or his picture picture of the ID document must be the same person whose information appears at the document. So that to avoid spoofing and the second assessment is that the ID document image must not correspond to a forced document must be a genuine document. How are we going to do it? For now, my colleague Kais is going to delve deeper into the technical aspects of the solution that we have designed. Hi everyone. Thank you Jesus for the presentation for the introduction and let's move on to the technical part. Next slide please Jesus. Well, I'll briefly present some basic concepts of digital image processing or image recognition. One of them is key points, what we call key points and their corresponding descriptors. From each digital image we can get key points and features of these key points. Features, I mean, the coordinates of each key point within the image, the size, the color transitions, and so on. So the key points and descriptors allow to characterize a digital image and more precisely content of this digital image. And one, in my opinion, one of the most relevant techniques to get the key points and descriptors is the scale fast, scale invariant fast transform. So it's invariant to orientation. So if we move to the next slide, one of the applications of getting all the direct, one of the direct use of key points is and finding matches between two images and look for similar key points within the same image. In the context of ID document verification and the impulse solution, we first start by pre-processing an image. So we have a reference image at the left and by getting the image from the user during the onboarding process. What we do first is to find matching between the key points using, for example, nearest neighbors as a technique to find matches and then to perform perspective transform. This allows us to first crop the image. So we discard all in use and useful information and then put it in the right orientation. So the question is, we want to answer, is to figure out if an ID card or a passport or driving license is forged or not. So let's imagine two simple scenarios of forgeries. The first one is a copy move within the same image. So let's imagine someone who wants to change his birthday, so to to seem older or younger, so to access some services, online services. So can copy a digit and replace the birthday or the validity of his or her identity card. This is the first, let's say, forgery technique. Another simple technique is the imitation forgery. So the imitation is introducing new characters or new digits to the ID document. And our objective is try to figure out if we have one of these forgeries in the photography, the photography is that the user is sending to us. So another concept we are using, he is the optical character recognition or OCR. So in order to extract text from digital image, we are using OCR techniques. We are combining two techniques, one using state of the art deep learning methodology to detect words and then applying another technique, more focused on the characters to get the bounding boxes of these characters. We are using OCR to focus, for example, on the machine readable zone, the zone in red. So this code after being read can be transformed into a dictionary with all the information available in the ID document or in the passport. In the passport also we have the MRZ. So what we do first is to contrast the data that we have in the front side of the document with the MRZ code. So we contrast both data and this allow us to verify our first assessment, which is to check if the user sending the data is the same one with information appears in the photograph ID document. And now let's move to the second assessment, which is to detect forgeries, specific type of forgeries. And coming back to the key points, we are going to focus only on key points in specific fields and make what we call as restrictions of key points. We'll focus only on key points present in the bounding boxes of specific, the bounding boxes, one in blue color in the front side. And after focusing on these key points, what we are going to do is to look for similarities between these key points in order to detect copy move forgeries. So in this context, we can use a clustering technique like a DB scan, which stands for density based special clustering applications with noise. So this technique allow us to detect key points, similar key points and to find copy move forgeries within the same photography. So a brief explanation of the DB scan, it has two parameters, the radius, which is the epsilon. The radius tells us if a point belongs to the cluster or not, if the distance between two points is below the epsilon. So the point is added to the current cluster. And the minimum point is the second parameter of DB scan. And it allows us to tell if a group of points is a cluster or noise. If it's below the mean points, then it's noise. Otherwise, it's a cluster. So in the case, in the first photography, after applying this copy move forgery technique, we detect that the number eight in the birthday is copied and pasted in the validity. So from the genuine image, the validity is until 2025. However, in the forgery document is until 2021. And here we can say that we have partially solved the first, the second assessment by combining key points extraction, making the restriction to specific text regions and then applying a clustering technique like the biscuit. The other scenario is the imitation of forgery that we will try to explain how we are detecting these forgeries. So we built a data set of a character based the morphology of the characters. So we are focusing on the size, the line, the position, the number of characters in each line, the skewness of the characters, the colors, the dominant one and the average one, the gaps between the characters and the alignment. All of these, in order to detect if there is a new character introduced in the document and if the size is bigger or lower than the normal size, if there is a skewness or if there is like the alignment is not okay. So to do so after after building a data set, we are training a model. It's a one class classification for novelty detection, since we are using genuine documents. And the model allow us to, well, the output of this model is a score, a global score to say, to give us the number of abnormal characters detected in the image. And the abnormal are the forged documents and the regular the regular characters are the genuine ones. So we are making a fusion between the copy mode detection technique and the character morphology based tampering detection to provide a solution for for detection. And here, with this combination, we have solved the second assessment. So after getting these two forgery detection, we are trying to minimize the error, the classification error by making a global score. And here we are assigning a way to each technique in order to, as I said, to, to reduce the classification error. So here is a summary of the different component of our ID documents verification module. So first, as an input, we have a scoring of face matching. Also, the data that the user introduced, the names, our name, birth date and validity, et cetera, but also in the most important, the image of the ID card or the passport or driving license. First of all, we do a pre-processing phase or step by cropping, enhancing the resolution or the brightness of the of the image, et cetera, and then applying the three forgery detection techniques to get a global score and to say, OK, this ID document is forced or not. If it's forced, we provide forgery proof. Yeah. Coming to the more or less implementation side, we are using Python and fast API in a localized environment and then uploading or pushing the container image to AWS elastic container registry and launching the instances on Amazon EC2. Well, now coming to the case studies, we are using impulse in different case studies in different countries, as you said, at the beginning of this presentation. And these use cases, for example, in the city of Ricciabic, they are using for the participatory democracy portal in order to allow the citizens to initiate some topics of public interest and, yeah, to to comment on different proposals and guide the policy making in the city of Hifon in Spain, impulse solution be used to to access to the public services application and the challenge also also and it will be used to to make or issuing complaints entirely online without the need to move to a police station or so on to make the complaints in Denmark for the electronic access to personal information and services in the Indian camera and for camera in Italy. And it will be used for the enterprise digital drawer. It's completely focused on the entrepreneurs and in Pestera and in Bulgaria, it will be used for several services like civil registration and certification. So I would like to thank you all for attending our presentation and ask you to follow us on our social media like impulse you on Twitter and LinkedIn, if you have any question or query. So please do not hesitate to contact us by email. We'll be pleased to answer all your questions and thank you. You both guys, that was that was fascinating. Of course, here in Spain, where we're used to the national ID card, but where I come from in the UK, we don't even have a national ID card, let alone an electronic one. And there's a lot of resistance to to having to for people having to prove their identity. Anyway, we do have some questions for you. So let's let's dive into them. You've touched on privacy and security concerns, which obviously extremely important, the possibility of forgery and the sharing of our of our data and so on. What sort of take up are you seeing in in countries for electronic IDs in general? Is this something that is is going to happen sooner or later anyway? Yes, I think it's something it's going to happen sooner or later. For the moment, there are difference not only between among countries, but also among other types of public administrations, such as municipalities, for example. And there is also a variety in the use cases that we are going to study, for example, in in the city of Reykjavik. They already have a big platform and another and other kinds of the IDs that are normally used in in Spain, we have an ID, which is the electronic DNA, which is has been circulating for for quite some years now. But the electronic functionalities or the use of these of these as an ID is minority in other administrations, such as, for example, in in in Bulgaria. This is the first experience that they are going to have with the IDs. So there is there is quite a variety among the administrations, but I think it's the future. And I think that sooner or later we are going to be moving to a maybe a full ID paradigm. I'm sure you're right. So regarding your specific onboarding system or method here, I imagine you already did consider this. But have we looked at other biometric, for example, the biometric data or other things that we could capture with a smartphone, not necessarily with a camera, perhaps a fingerprint or a holographic image of the face, something like that, that we could incorporate that might be more accurate, more unique. Yeah, actually is using a personal data is very sensitive. So including the biometric data is we are raising the bar of using personal data going even higher. Yeah, and we are working on a pilot pilot project. And we think that the next step would be using this kind of data by preserving user privacy integrity, data privacy and integrity. OK. So we'll move to a couple of quick, very specific questions here. One is from Alejandro. He says, hello, do you use GANs to generate the documents? No, only pictures taken from smartphones. OK. And one from Daniel, you outlined about the key points and he asks if they're generated using OpenCVM. Yes, right. OK, fantastic. So we wanted to two questions quickly. Because it was really short. Yeah. There are really specific questions and I don't know if there are well, it's a very popular package used in order to make image processing and there are plenty of functionalities. You can take advantage of so you do not have to reinvent the wheel. So yeah, we are using OpenCVM. It's a very exciting project and what I find so exciting about it is the possibility of a sense of ownership over the data. So I'm really looking forward to it myself. We're almost out of time. It's been a fascinating talk. I'd like to thank you both very much indeed and if anyone else has any more questions, please, as they said themselves, get in contact with them, use the platform or you saw their contact details there at the end of their talk. So once again, Jesus and Kais, thank you very much indeed. Thank you very much. Thank you for your invitation and have a good day.