 So, a little bit of history. The data protection and privacy techniques has been researched since 1850 in the United States when the United States Census Bureau started to remove the personal data from the public available census about United States citizens. So, nowadays everybody knows that the internet is full of data and roughly nearly 36 percent of that data is related to medical or health care system and nearly 25 to 26 percent of that data is related to fintechs, economy, e-commerce, banter sanctions, stuff like that. So, regarding this in that data is a lot of our personal data, so regarding this privacy regulations such as the GDPR and CCPA have strict regulations to provide strong protection to that kind of personal data. So, techniques for data minimization will enable business and public administrations to adhere to these regulations and protect that data from misuse or abuse. So, what is the importance of the uses of the anonymized data? Let's put an example. Let's say that a hospital needs to share some kind of their patients' data in order to carry on a research, a medical study, so this data should most be anonymized in order to protect the patient's privacy. That will include anonymize his names, bank accounts, ethnicity, sex. So, as I said, the data minimization seeks to protect private and sensitive data by deleting or encrypting, identifying information. The data minimization is done for the purpose of protecting an individual or a company, private activities while maintaining the integrity of the data gathered and shared. So, the anonymized data, when it's shared, should be keeping his whole integrity and only anonymize the data that should be sensitive, yeah, that's the idea. So, the data minimization is also known as data fuscation, data masking, or data identification. The GDPR splits the, in special categories of personal data, followed these categories for confidential attributes, belief, social, religious, or philosophical belief, politics, political opinions, trade union membership, sex, sexual orientation, or sex life, ethnic, racial, or ethnic origin, health, health, and ethnic, and biometric data. This also includes sensitive health related habits, such as system abuse, and not conf, that is not confidential information. In this case, most of the entities. So, the NPA project, which a few of you have heard before about the NPA project. Okay, one of my favorite, okay. NPA project is the acronym of Multilingual Anonymization for Public Administrations. The ultimate goal of the NPA project is to develop and provide a fully solution for multilingual anonymization kit based on name entity recognition, mostly of you for sure are familiar with the name entity recognition. And this name entity recognition should be applicable for all the European Union languages, including those languages on the resources, such as Latvian, Lithuanian, Croatian, and Slovenian. And this kit will not only restricted only to European names or surnames, but also for most common names in all European Union countries. And with a connection to each translation, irrespective of whether the text is monolingual, bilingual, or with missing languages. This toolkit actually should be able to detect personal data, as I said before, name, addresses, email, credit cards, bank accounts, between others. Moreover, this toolkit will be able to minimize this data, thus it will help public administrations to comply with GDPR particularly in the health and the legal fields. So, these following names are only shown for the purpose of proving many European countries and enterprises and government collaboration with the NPA project. Within there is Panjianic, it's a Spanish company that provides NLPL solutions, adaptive machine learning translation that amassing anonymization, artificial intelligence power in power translation services, Tilde that provides translation services, chatbot, localization, and express transcriptions. The European Language Resource Association, Evaluations and Language Resource Distribution Agency from France, and the University of Malta, among others. So, so far what has been achieved by the NPA project? They provide already decreased engines for anonymization in 24 European languages. The pre-trained machine learning models for legal, clinical, public administration domains. The annotated data sets for name entered recognitions with nested entities and there is also in currently a good approach out of these applications in the legal ministry of Spain. All of these resources by the NPA project are currently available online, our open source, including the pre-trained models, the annotated data sets for all of the languages and they also provide a demo online of how the anonymization works. So, what is the anonymization in the GDPR context? In the article 4 of the GDPR it states that a personal data means any information related to an identifiable person or data subject that can be identified either by name, identification number, location data, online identifier, such as emails, usernames, URLs to profiles and websites, physical, physiological, genetic, mental, economical, cultural or social identity data of a natural person. Also, the GDPR says that processing of personal data revealing racial or ethnic origin, political opinions, or philosophical belief for the purpose of uniquely identifying a natural person shall be prohibited. So, now we will show you some of the anonymization techniques and examples used within the NPA project and our company. So, general techniques to anonymize data should be using gaps. It will be the translation of the recognized entities with special characters, which can be on the scores or full block of text, placeholders using alphanumeric symbols with the similar length of the replaced word entity, in this case to preserve the original format of the document if you are anonymizing a word document, for instance, or a self-spreadshed, you should maintain the format of that document using tags, using a predefined tag that preserves the ground money information for this case or using pseudonyms, is the replacement of an identity by another entity of the same type. For instance, we are seeing here the example of using these techniques in the case of the sentence, Albert was working in Japan's GM earning two millions. Actually, Albert, I think, was doing some kind of dirty business to end the kind of money. Okay, using gaps, as you see, we identify four entities. Albert is a name, Japan is a country, a GM is an organization, and millions is quantity. Using gaps, we, like straight through all the entities, using placeholders, we replace the entity with, in this case, using the letter X, using the same length of the entity recognized it. Using tags, we put the tag type and the entity type instead of the entity in that case. And using, in this case, pseudonyms, the sentence will be, John was working in Britain's GM earning four thousand. So we replace it completely using pseudonyms, the sentence. Well, in this world of machine learning and LLP and name entity recognition, what you should be familiar with, not everything is unicorn and rainbows. In the case of name entity recognition, we face many problems including linked entities. For instance, Keith Arthur could be a split with title and name in this case. A lot of the rain, if you use that sentence in a name entity recognition, for a name entity recognition model, usually says that it's a title and it's not a title, it's a word of art, a movie, a book. The ethnicity tags, the names also, for instance, let's put an example, that with my company. If you said that with, it's an American company. The most of the name entity recognition machine learning models which says that American is an adjective and they not recognize American as an entity in this case. That is grown. And also the address. The address is a really, really common problem because most of the countries write the address in a different way. So this is at glance the architecture of the Pythonization toolkit. The first step of the flow is receive the unstructured private data as input, so the data itself. Then we decompose that data in chunks. For instance, if we receive a whole document or a whole paragraph, we split that paragraph in sentences. After that, those sentences or chunks are sent to the artificial intelligence detection entities in this case, the machine learning model. After that, we proceed to the replacement of the entities. As we saw before, we can use one of the techniques that I showed you, GABS, pseudonyms. Then we create an index of the reversing. And that index is used by the emitter of the anonymized data in order to have a technique to make the reverse engineering for the anonymization process. And after that, the anonymized data is sent to the client, to the final client itself. So which tools we use during the development of this Python toolkit? I like to call the Holy Trinity, the FASAPI-Pydantica-Iricor, FASAPI because we created a REST API for the dutch ingestion validation, anonymization of documents and text, and pydanting for data validation and faster serialization. Iricor has a web application server and also I strongly recommend this book from Francois Boron. It's called Building Data Science Application with FASAPI. It's a really, really nice book. There is the URLs in Amazon. I will share the slides when we finish the talk. So with future expectation, we should expect us in the MAPA project and that will be as a company. We are trying to improve the MPA data sets for future public and private models, increase the data in legal and clinical domains, and also provide customized spacing models using deep learning with Think. I don't know if you are familiar with Think. It's from the creators of Space itself and also cascade arrangements of nerd models. Which remarks we should have so far? Anonymization is still a research area with too much applications, without too much applications, sadly. The say of anonymization is a strong approach to keep a redouble and save the documentation at the same time. The models provided by PyTorch and Spacey, name, entity recognition, public models are excellent at certain points for create an anonymization solution. Actually, in the Spacey website, they have a really, really nice example of how to start with their models and the English model is great. For the creation of custom models, as everybody knows, we need a lot of data and not data data. And in this case, we can reuse the MPA project data and revisit the data and improve the data. And also, we should define our tag list really carefully because private data is a really, really long set. It can grow as much as we can. Okay, thank you very much. Thanks Oscar for the great talk. We have a lot of time for Q and A, so if someone needs to ask anything, please use the mic. Thanks, can I use this as a library, not as API? I'm sorry? Can I use it as a library, not as API? Yeah, yeah, definitely. Okay. Actually, if you use the Spacey solution, you can use it as a library. You load the model and you can use it as a library and provide your own solution. Great. Hello, my name's Siegfried, but I still have a question. Do you intend to go first as in just substituting entities with your AI model? For example, if I have a text and all of my personal data is just redacted, you can still find out it's me because of the grammar, the vocabulary, I use similarity to other texts I've written. Is that the next step where I can actually change the text itself? Sorry, can you repeat like it was... Oh, sorry, is it? Oh, okay, now it's okay. Now we're working. Yeah, I was asking if you plan to go first as in just substituting very specific entities like my name and my sex and my ethnicity. For example, the actual grammar or vocabulary that's used in the text. Yeah, yeah, you can actually do that. Oh, that is awesome, I like it, thanks. Actually, one of the first and most common techniques in the name and the recognition is dictionary-basic, name and the recognition and that's also solved that. Thank you for your talk. I have a question. As you mentioned, spicy is a great solution for English language. Yes. If we take like French or maybe Spanish, the entity recognition becomes more tricky. Do you think that by combining anonymization with synthetic data generation can enforce or sometimes replace? I don't know if you try to play around with... Yes, actually Spanish is my mother tongue so this is the cause of my really bad English. So yeah, in those cases, I mentioned English model because they have really, really huge score. So the data set, the user for that model it was really, really great. You can, they provide a lot of models, the spacing itself. You can also use Flare. There is another framework based on PyTorch but the numbers are not really great as a spaces and Flare is kind of really expensive in matter of resources. We are deploying that model. Usually in these cases, the best solution is a pre-trained model, custom model with your own entities, with your own data but that data should be tagged by experts in that field actually you can use the provided models but spacing and beyond that, I don't know, use techniques, manage the loop and provide some kind of feedback in order to improve that data. Also they provide Spanish, Spanish, Portuguese even Japanese, they have even Japanese that symbols are really, really difficult to identify entities on that language but they handle that so. Hi, thank you for the talk. I have a question, like what if the documents or the dataset to want to anonymize contains multiple languages within it so like English and Spanish at the same time or any other combination, like do you see it as like you just run it multiple times with different models or is there another? You can use actually multiple language models. At the same time. At the same time, yeah. Actually when you provide service for translations, companies, usually they send you a document with two or more languages so you must be able to provide several languages in your solution so in that case it's not, that is already covered as well. I have another question. So the use cases that you showed us fit well for textual data and do you have any idea or solution already for tabular data especially what I'm interested in is to ability to not reverse the data like we had an example, I think it was Netflix who published his tabular data for competition which was anonymized and then some with tricky manipulations. Actually one of the solutions that we were currently working on is anonymization of databases. So as I said, in order to anonymize data you should always split that data into chunks. So for instance you will anonymize one cell speaking of tabular data. So there is no problem with that. You can anonymize a spreadsheet or presentations, XML files but that process is always from your side not from the machine learning model. He only identifies entities so that should be done by you. Thanks for the very interesting talk. Thanks to you. I have a question about the MAPA project. They have a model as fast as I know for several different things. For example, legal texts, clinical texts and so on. Do you know if there are performance differences between these topics? Yeah, yeah. You mean performance in order to the score of the models or the recognition of, yeah, definitely, definitely. And which is working better, which is worse? No, I don't have that answer on hand by now. Thanks. But yeah, definitely. And there is a lot of work to be done regarding those models and the data and the data by the way. Hi, thanks for the great talk. Thanks to you. It might be a silly question, but can you reverse the anonymization if you don't have the original data? Sorry? Can you reverse engineer the anonymization if you don't have the original data source? Yeah, usually when you anonymize a document, for instance, you create like an index. Oh, okay. So you identify the entity, for instance, your name, and you define an index for that entity. And that index is held by the organization that actually anonymized that data in order to provide reversing it into that and remains a data. All right, thanks. All right, thank you very much.