 Hello and welcome back. Now at some point in your life you're probably going to need to speak to a lawyer and if you're unlucky probably more than once. And if you already have spoken to a lawyer a few times I'm sure you've come across what we call legalese. It's that way lawyers have of speaking. If there's a way of making things more complicated they tend to find it don't they? Sometimes it's really hard to understand exactly what a lawyer says. For example in witness whereof the parties here on two have set their hands to these presence as a deed on the day month and year here in before mentioned which in plain English means the date. Lawyers spend a lot of time organizing and going over documents which is why the law is an excellent candidate for natural language processing or NLP and that is exactly what our next speaker is here to talk about. He is Álvaro Barbero chief data scientist and he's here with us right now. Álvaro great to see you thanks for joining us here and coming in person. How are you doing? Great a little bit nervous but well it's my fourth time I think I'll speak things so I'm really happy to be here again. You're a veteran Álvaro I'm sure you're going to be absolutely fine. We're all very much looking forward to what you have to tell us so when you're ready I'll let you take it away. Okay great. So yeah I'm going to speak about natural language processing and in particular natural language processing in the law sector and well before starting let me just tell you a brief story. So this is going to be a story about going places and what I mean by going places is well since the beginning of mankind you have to solve this let's say transportation problem you want to go somewhere else you want to go from point A to point B so how do you solve this problem? Well the main solution or let's say the most common solution for most of the history of mankind has been well you want to go there well you just walk there right and it has been like that for most of the history since the advent of the Homo sapiens but fortunately like 5,000 years ago we developed new methods to solve this problem you will try to reduce the physical effort you need to do to go to somewhere else by putting that effort into a different being so for instance we invented horse riding well unfortunately for this animal it means that it will do the effort that we will normally do but fortunately for them this changed again in the 1886 where the first automobile was created and now we are placing that physical effort we need to move ourselves no not into another living being but onto a machine created by technology so this machine is powered by fuel and now the effort we need to perform the physical effort is almost zero now the interesting thing is that this trend is starting to change right now because in the near future what we will probably have in our streets will be self-driving cars and the change in the trend here is that now we don't need to we are not creating this technology to reduce physical effort but to reduce cognitive effort and that means that when I want to go somewhere else I don't have to think of all the little details of the plan I want to do the route I will just tell my car I want to go to this point and the car will do everything for me the physical effort the thinking of the driving all the decisions so this is the trend that we are seeing in most artificial intelligence applications right the point is of this technology is no longer to reduce physical effort but to remove the need to do this kind of menial task repetitive tasks tasks that are requirement of force but that not really so useful or interesting for us so this is the point and this is the application that we try to develop for the legal sector so this application was all about trying to create a map of all the information that pertains to a card case and I'm going to describe you what this system was about and all the let's say lessons we learned from this project so first let me give you a brief disclaimer so this system I'm going to talk about was a real system working with real card cases but of course I can't show you any real documents here because of legal reasons you know I everything you will see here are either public legal documents or mock-ups but you will still get a nice idea of the kind of documents we have to deal with right so this is the map of the expedient the project which is the Spanish name for car case map project this was a joint research and development effort between my company I I see and Rodriguez which is one of the biggest players in the legal sector and the main effort the main point here was to create an artificial intelligence solution which will remove this cognitive effort from the everyday lawyers work so let's build a system that will try with input we receive a lot of different documents from the same court case somehow organize that information extract the relevant data and plot a map of the whole court case so the lawyer can have a better day going through all the information and looking for the relevant data okay so this is the team at I I see that works in the project you can see it's a diverse team with different profiles we had people from data science computer engineering computational linguistics experts in graph visualizations so it's because of this diverse team and also because of the help we got from the rigorous lawyers that we were able to build this solution right so like every big project everything started out by drawing everything in a napkin right you know every big project starts like that you get an idea you drop it in the pub in a napkin and you start making the project out of that well the thing is I would like to say this is a true story and we had this napkin in our secret mission in our company well unfortunately it's not like that actually what we did is I draw this diagram in cheap application in my laptop but still you get the idea right we started with something very very small just a diagram of what we wanted to build you can see here a small graph in which you can see the relevant people in the court case the relevant companies how they are interconnected through different court case files and the kinds of files that we want to identify right so this was a core idea and well since we have been working natural language processing techniques for a few years already we identified which were the key tasks we needed to solve to achieve this so mainly they were first classify the documents into different kinds that will be useful for the lawyers also reading automatically through all these documents and finding out the people and companies that were mentioned in this documents and then after we gather all this information create a visualization that will mix everything and will be helpful for the lawyers so yeah if you think a little bit about this and you know a little bit about natural language processing you might realize like well this can't be so difficult right it's about document classification name entity recognition and then plotting everything together so we had a team of experts in all these fields so we said well this is gonna be easy peasy right we just need to apply all these fancy deep learning models to the data and everything will work out well let me tell you something reality happened so what this means is we found a lot of obstacles we had to overcome so let me tell you just a brief of them so first we expected the data to be as nice as this document you can see here again this is not a real document it's just a public document but you can get more or less an idea of what the real documents look like so we expected something like this right you could just take the text out of this nice PDF document and work with us that road text well unfortunately reality was more or less like that so you will have documents of let's say Vora in quality you will have tilt of pages markings somebody who wrote something in the margins post it and all this kind of stuff so we couldn't just get the text from there we needed to apply optical character recognition techniques to get the text out of these documents and now if you have ever worked with OCR methods you know that these are not perfect and we soon realized that even if we expected to get nice quality text out of the documents most of the times you will get something like this so you will have wrongly identified characters you will have text that should be in a margin and was embedded inside the main text so we had to build much learning models that were robust to these errors now the next hindrance we found might surprise you when we were just going through the pages of all these real documents we found out that sometimes you will get a page like this one so what the hell is this so there's a blank page but it's not entirely blank because you can see some markings there well actually what's going on here is that maybe somebody decided to print out this document as one-sided documents and then somebody else decided to scan back these documents as two-sided documents and of course you will get a blank page for each one real content page sometimes it was easy to identify like these ones but sometimes you will get other pages that had no real no real useful information like covers or some kind of document splitters and we had to develop a general technique to remove this data because they were noisy pages now going on with the problems this was another surprise for us we actually expected that when we get the data for a court case we will get one pdf file for each one of the documents that was involved in the case but actually what we found out that is that we will get some pdf files that were really long about a hundred or thousand pages even and we thought well maybe these are legal documents they are really so long but actually not what will happen is that somebody scanned a stack of different kinds of documents all together into the same pdf file so we have to develop some method to actually take a look at two pages and question ourselves well are these two pages part of the same actual document or it just so happened that somebody scanned them together so we have to develop a method to detect this and split documents and finally and this was probably the largest obstacle to apply machine learning techniques all this data we had was completely unlabeled so this means we cannot apply any supervised machine learning methods here and what you probably know about unsupervised machine learning methods but let me tell you these methods work sometimes and in this context they weren't gonna work because the problem was very difficult so the solution here was to create our own training and testing data set out of this unlabeled data so you can see here that we had many more tasks than what we initially planned but we still went out to the project and let me show you a brief overview of the whole system so the system works like this so you will get all the pdf files pertaining to a card case first we will run the optical character recognition methods then we run some kind of heuristics to remove those transcriptions for the OCR that were of poor quality because we actually realized that very poor transcriptions were adding noise to the system and it was better to just remove them completely then you will get a bundle of three machine learning models that will first remove non-informative pages from the data then split each one of those pdf documents into their real individual documents and finally classify each one of those actual documents now in parallel we have another machine learning pipeline that will take the only the useful information and apply an in an entity recognition model to find out persons and organizations appearing in the data now that was of course not gonna be so easy because you will often find that the same person will be spelled differently in the same document maybe because it really is a different spelling maybe we have the surname in some documents and not in others or maybe because you have these OCR mistakes I told you before so we had to run some heuristics again to group and clean those entities and then with all this extracted data we could finally build a visual extension showing all this information together so let me go into the details of each one of these pieces so first how did we build the training and testing data how did we build the corpus well Rodriguez was so kind as to provide us with all the data pertaining to six different court cases we kept one of those as the test data set and the rest for training and you might think that six cases are not so large quantity but let me tell you that these curses were really huge in amount of paperwork so you will have you will have more than a thousand documents and even around 80 gigabytes of data just belonging to these six court cases so we started with that nothing was labeled first we made some annotation guidelines to define which kind of document classes we wanted actually to annotate by our machine learning models so this was a joint work between our experts in computational linguistics and the lawyers from Garigas to find out which kind of automated classification will be useful for the lawyers and then look out in the documents if we could actually annotate the documents in that way that will be useful we started by a taxonomy of 13 categories of documents but in the end we realized that there wasn't enough representation from some of these classes so we merged some of them together and in the end we ended up with eight different classes so once we have this formally defined we started manually annotating the data so again we had two annotators in our team and they will independently go through a lot of different pages 28,000 pages annotate all of them manually to tell which kind of document was should be assigned to each one of those pages now they will mix their annotations and show the conflicting cases in which we have mismatching annotations but we found out that about 80% of the pages will get the same labeling by both experts so we think that was a good enough annotation it took a lot of time that you can imagine but we obtained a nice training and testing data set for this classification problem and then about the name entity coordination we had to repeat something similar we selected the kind of entities we wanted to detect at first we were interested just in people and companies organizations let's say but we also added the location entity to this annotation guidelines because this might seem trivial but it's not always so straightforward to identify whether something is a place or organization if I told you about the Spanish Ministry of Justice you might think well that's trivial right it's an organization but it's also actually a physical building in Madrid so you need to have some more details in the structure of how to do this and this is just one case there were many more conflicting cases so again we define these guidelines together with the lawyers from Guerrillas and then after having this formal annotation guidelines we went through all the job of annotating 500 different pages now this time the amount of pages was way smaller but just because annotating entities is way more difficult you can't just have a glimpse at the document and know what it is you have to read through all the words in the document still we'd got an annotator an inter-annotator agreement of about 83% which is still quite good and then we have our training and testing data for this problem so now we have our data set now let's talk about the machine learning models so focusing on what happens when you input one pdf file into the system what happens is first we run it through the OCR method then we will have a transcription let's say the rule text for each one of the individual pages of this document then we'll have the first machine learning model that we take a look at each one of those pages and remove them if the model decides that they contain not useful information now we have a first branching of the machine learning pipeline which will partition the document into the different sets of pages that perform the actual real documents remember those documents that were stuck together right we are going to split them and then we have the classification moment which we will assign a level to each one of those actual documents in parallel again we will have the name entity recognition model which will annotate the entities persons and organizations appearing in each one of those pages and then we will join this information with the output of the partitioning model to obtain entities at the document level after all these heuristics I told you about about cleaning entities grouping entities and so on so let's go into each one of these models in a higher level of detail so the first model informative versus non-informative it's actually quite simple conceptually it's just a binary document classification model so we can train it with the data we actually labeled manually and this will tell us which pages we need to remove now the partitioning model is not so simple because here we are solving a non-standard natural language processing problem so we have a very long document with hundreds or thousands of pages and we need to find out in which points we need to split the document into a new set of documents so the way we solve this was to transforming the problem into a binary classification problem we will create a model that will take a look at one page and the next page in the document and ask the following question so does this new page follows the current document we train them all to solve this question and when the model answers no it doesn't we will create a splitting of the document just at that point right after that we can run the document classifier model and this is conceptually more easy we just select the splits that were created by the previous model analyze all the pages there and assign a level in this taxonomy we created of eight pages and finally the name entity recognition model is also quite a standard just analyze our page and mark out where the entities are just burning mind all these heuristics I told you about but the core idea is that now what's inside these tiny little robots I have been showing you here so what are the real muscle learning models well of course if you know about natural language processing you might have already guessed they are language models so I have been talking about language models I think for the past two years at this conference so you might be bored already but if you don't know them let me tell you so the key idea here is that when you want to solve a very complex natural language processing tax you cannot just do it straightforwardly but you need to do is split the problem into different tasks so the first task is to create a model that kind of let's say understands the structure of your model but the model does actually is to learn what is the distribution of words that is most frequent in your in your language let's say the English language or the Spanish language or whatever language you're building your model for and then once you have this general language model for a for a particular model of a particular language you will perform the second task which is about fine-tuning that general language model into the tasks you want to solve it might be document classification or entity detection or whatever so this way of working has been very effectively for the last two or three years the first method to actually show that this was a very nice idea was the BERT model for the English language and this model is openly available so you can just take that model and download it it's freely available on the internet and just run the fine-tuning a step so half of the work is already done there is a lot of models that are available for the English language but unfortunately there are not so many models for the Spanish language so when we started this project there was only just a single language model for Spanish that was good enough this was the BERT model created by the University of Chile and we actually tried that model for this project for all the machine learning challenges we had here and these are the results so we compared more classic natural language processing methods against using this BERT model for the Spanish language you can see the classic methods in this slide as the blue columns and the new language model methods as the orange columns and you can already see that there is a significant improvement especially in those problems that are considered harder so finding out which pages are informative that's very easy there's not such a large difference but when we go to harder and harder and problems the difference between models is quite significant for the entity detection we didn't even try to use the classic methods because they were very hard to apply to this problem and the language will seem to work very well okay so you have seen all the path here right it identifying all the problems the new problems that appeared when we were solving the actual project and then solving each one of them using different techniques and applying the latest advances in natural language processing but now the question is and we might ask BERT this question so are these results good enough and what BERT might say to this question is they are never good enough we have to do better right how can we do vector so we develop the following idea this is the main way of applying language models I just described so you pre-train your language model for general data from data language and then you find you need to a particular problem we actually added a new step here because the legal domain is very specific in the kind of wording and expressions it uses even if you are talking about the same language and the key idea here is well let's take a general language model for Spanish and fine tune it to the legal Spanish which is a particular kind of Spanish and then we can apply that method to different tasks in in this project so how we did perform this so first we gathered a large data set a corpus for Spanish legal administrative documents we gather different kinds of open source data to to obtain this about eight gigabytes of documents we run significant processes to clean this data the duplicate this data and so on and in the end we ended up ended up with about three gigabytes of data about a half a billion words which is not such about corpus right and we've performed this domain adoption procedure in two steps so first we use all these corpus we gathered to adapt the beta language model for general Spanish into what we call legal beta so it's a beta language model specialized for the legal domain but we had another corpus we could use right all the court cases that Garregas provided us so we performed a second adoption step using all this data and with this we obtained what we call the Garregas beta language model and the key insight here is that this last model has seen the data with all the mistakes produced by the OCR so in this way we are making the model robust to these errors we will find in the real data and production so here are the results we got and you can see as a green bar either the legal beta or Garregas beta which ever worked best when we applied it to the task I described before so you can again see some advantage here of using this domain adoption step we also tried these problems with an openly available language model for the Spanish legal language which is was made publicly available this summer which is called Robert Alex and well unfortunately the results weren't so good so maybe because of this reason that we had to adapt the model to the OCR data so it seems that this custom made fine tuning and domain adoption step worked very well so as you can see here we can actually go to the further step of incorporating new techniques in the state of the art of natural language processing and they collaborate to produce a better solution so where are we standing right now well the journey still continues with this project and what we want to achieve is a simple system that will take all the pdf files from a court case and produce a visual like this one this is just a mocap but you can already get the idea so we would like to draw an actual map of all the people companies and documents that are related in the court case how they connect to each other through different documents so maybe the lawyers can use this to get some interesting information of course when you go to a real court case this graph will be huge so we are still defining with Garregas how we can present this in a way that is useful and actually we are incorporating this into the processor tool that the Garregas has this is a tool we created also together with Garregas some years ago which aims to allow the lawyers to search for data in instructor data files like pdf documents again but also audio transcripts and so on so this is the product I wanted to share with you you can see that going from the lab from the let's say from the lab to the law so going from the actual experiments we will do with language models to the real application in a actual project involves a lot of different hindrances but still all these fancy deep learning methods you can find around are useful and you can apply them in a real project so I hope you learned something from these lessons we have been sharing here and well maybe let's see you again in big things next year oh sorry so I forgot something really relevant this is the talk I wanted to give but since I have I think five minutes left or so yeah so I wanted to tell you something else so as I said before last year I was also present here at big things and I talked about an let's say interesting project we have been working here this project was called Rigoverta and the aim of this project was to produce our very old language model for the Spanish language so the aim here was to use these five key points to be able to produce this model so use a very large training corpus place a lot of focus in the cleaning of that corpus also use better training hardware better neural architectures from the latest advancements in natural language processing and apply this domain option ideas I described before and that we actually apply for the legal domain so the timeline of this project has been two years long we started in two years ago last year at big things I presented the alpha version of this which had some interesting results but we weren't still quite yet and today I can say that we have completed this project we have this language model in the 1.0 version let me share you very quickly some interesting facts about this so as I told you one of the key points here was to use a very large training data set we use the spawning and annotated corpora which is the same corpus used for the beta model but we also added two huge open sources for web crowds of the Spanish language oscar and mc4 and also our very own data set which we call for different media outlets we also try to incorporate the latest at the bottom end in new in deep neural networks for natural language models which were the usage of the divert architecture you can see here in this superglue challenge that the divert architecture was the first model to achieve a performance over the human performance and right now it's in the third position in this ranking but let me tell you when we decided to use it it was in the first position so this is the architecture we use for Rigoberta and of course we wanted to test how well this model will work so we compared against the three models that we think right now are most representative of the Spanish natural language community Beto is the one I just described we also have Maria which is a joint project between the national Spanish library and the Barcelona super computer center and also the Bertine project which is a community effort to create a language model I have to say the Maria version we use is the one that was released in the summer it's based on the Robert architecture just last week a new architecture was released so we were really had didn't have the time to do all the comparisons properly but here you can see some of the results so just to have a fast interpretation each line is a different pili and natural language processing task and you can see a star marking which model has the best performance so you can see that Rigoberta seems to be performing very nicely in many of well almost all of the data sets sometimes by small margin sometimes by a large margin but we are getting very interesting results and let me also say that we went to the extent of running what is called an MNG test for statistical difference this is a test that will tell you if there is enough statistical significance that a model is good better than another model and we have been able to check that Rigoberta seems to be better statistically than Bertine and Beto so I just wanted to share this with you very nicely this is interesting piece of news for us and well we plan to introduce this new Rigoberta language model in all the projects we are performing not just in this project by Garigas but also into different kind of fields in which natural language processing can be applied which is health maybe finance sector and so on so now for real I'm not tricking you anymore this will be the end of my talk and if you have any questions I will be happy to answer them if you don't have time to question right now you can see the links to my social media in the end of that talk and you can just ask me later so thank you for listening that was great Alberto thank you so much thank you isn't it ironic in a way that we're talking about natural language processing and lawyers don't speak in a very natural way at all so I think maybe what you're coming up with here is an unnatural language processing model well we can say that every domain has the different way of speaking the same goes through for a health sector I mean the way doctors speak is not so similar to lawyers not so similar to general Spanish so that's the reason that explains that if we can actually adapt these models from general Spanish to particular professions of Spanish let's say you will get an improved performance so it is still natural language but in a small piece of all the natural space for sure I was just just joking with you I loved your talk very nice analogy with transportation at the beginning which and your use of or explanation of reduction of cognitive effort and the trend that we're on so that helped us to to understand greatly what you were talking about and that was great and I and you didn't speak like a lawyer at all you were very clear so very good I congratulate you and it reminded me a little bit of the story of Van Halen I don't know if you remember that story of Van Halen and the M&Ms I'm not sure either okay you know Van Halen is the rock band famous rock band and they used to tour and do very very big tours and at the events for each show that they did required very lengthy contracts so that all of the security was in place they were very worried about health and security and so on and so they put into their contracts a clause which said that in each in each event backstage they wouldn't want a bowl of M&Ms the little sweets but there would be no at brown M&Ms in the bowl they didn't say why they just wanted to make sure there were no brown M&Ms which seems totally trivial a bit like a diva but actually the reasoning was that they would if they found a brown M&M in the bowl then they knew that people hadn't read their their contract properly and if they hadn't read the contract properly then maybe there would be some other health thing or security thing that they hadn't taken into account so I was reminded of that story and thinking well if they have maybe this natural language processing model then someone maybe could read these contracts very quickly and pick up on the brown M&Ms maybe we can train our models to find brown M&Ms yeah who knows so that's actually what I was going to ask you I mean do you think that this has a consumer usage in the future I mean could I for example scan a legal document and perhaps have that converted into a kind of plain English that I might actually understand could we get to that stage hmm well I know that has been some work in trying to let's say translate for general English to simple English for people that are not so proficient in language maybe we could try something like that but the main issue here is always the data so as you have seen this break also we have to manually annotate a lot of data to do make them all do what we wanted to do so here I guess I will be the same so we'll need a lot of lawyers and a lot of no lawyer let's say journal people to actually translate from lawyer English to plain English maybe it could work but yeah I don't know we will have to find applications we mentioned other sectors and how each one has its own idiosyncratic way of talking but would you say that when we talk about the law in particular we've really identified the greatest challenge for you or is there something that's even more difficult to to understand well about the wordings and so on I think the law sector is difficult but remember that I I told you we need to apply OCR methods to actually extract the the actual text from the document but we didn't go to the extent of grating handwritten text into the model that's another level of complexity and I don't want to name any professions but there are professions in which the handwriting is way more difficult to get yes I think we can all think of those I'm sure okay that's great well Alvaro we're almost out of time so I want to once again thank you very much for your time and I guess hopefully we'll be seeing you next year I hope so that's the trend we're on so Alvaro Barbero thank you very much indeed