 So, hello everybody, so we are Anne Lout and Bruno Thomas, we are coming from the ICIJ, it's the International Consortium of Investigative Journalists. So it's the organization that released Panama Papers that we just saw before. So it's more about journalism than science, so thank you for admitting it, admitting us. And so we are going to see how we could use an index machine like elastic search or solar to help journalists to find facts and stories among really big document corpusies or leaks, we call them leaks in journalism world. So we are going to see how we started from legacy components and how we had some needs for the journalists and how we used that bricks to build another component or another project called DataShare and we are going to see the challenges and how we tackle them. So first of all, just a little parenthesis, legacy software is very often connected negatively, it seems like a drag that you are pulling and that is lowering the productivity of software development. Here is more at the first sense of the word, it's something that is transferred to someone, so it's like an asset and the asset has some drawbacks and we are going to see how we use the asset and remove the drawbacks to address the new needs of the journalists. The second thing about legacy is it's capturing knowledge, so we used this component because there was knowledge about natural language processing and also what has been done before about the main, the huge leaks like Panama Papers and the other thing is that everyone is producing legacy at some point, so in fact really fast after a few months of development we are often finding that the code base that we built a few months before is already legacy. So the first brick was extract, it was a common line tool and it was used on the Panama Papers and it was 11 million documents and it was 2.3 terabytes, it was used also on Swiss leaks and Luxembourg leaks and it was about 10,000 lines of code, it was Java and you can see the workflow here, so basically it was scanning the file system, so we received some leaks on USB key or hard drive or whatever and then it's scanning the file system and it's putting all the path in a queue and a shared queue, then there are extractors that are based on Apache Tica that are extracting the text and feeding a report map to be allowed to see afterwards which file has not been extracted correctly and that is sending the text to a spure that is outputting the text into several outputs like rest service or file or STD out. We used extract, most of the time it was solar, that was the index we used at the time and that was one of the problems that we were facing and we will talk about this later. Then when all the files were indexed into solar then there was another open source component that we used called Blacklight and it's a Ruby on Rails software and then this Blacklight is a kind of framework that allows to build websites to browse some big copies of data and it's used for example by libraries or universities and so we used it for the journalists to help them to use solar and without having to know actually the complexity of the indexer languages and inputs. So it was basically quite simple, it's a front application and the backend in Ruby on Rails and then the backend index. Then there was this, it was a repository, it was already called Datashare but it was not in production control, it was not the same that the Knowledge Center that was used by a lot of journalists. Datashare was a proof of concept and was doing quite the same at the beginning. It was parsing and extracting with Tikka or so, the new thing was the five, I don't know if you can read but the five natural language processing pipelines that were used to extract name entities from the text and help journalism to have a grasp of the leaks quite fast. And that was indexing these name entities into the index. So the two new things here were the natural language processing and it was using Elasticsearch index and so we had quite overlapping between Datashare and Extract and we wanted to do approximately the same as the Knowledge Center but with improved user experience. And yeah, that's it, that was the three main bricks and all the open source world that was bringing us some other tools and we can now explain how we built Datashare with this. So yes, so in a sum up, so that's the issues that we had to fix. So first in Extract the package was pretty big, the package in itself was pretty big so we decided to split it into two parts. One was only the Java Libs or the JA that could be included in our new tool and the other one part was the command line interface to use it through your terminal but without any ghee. And we faced also the problem of scalability with the Solar Index because we had too many rows, too many documents in it so we wanted to switch or to migrate from Solar to Elasticsearch to avoid this problem of scalability. Just as a reminder, yes, just like Bruno said, the panel of papers was... how many? So 11 millions of documents and our last leak, I don't know if you heard about the Luanda leaks two weeks ago about Angola and it was more than 700 documents so that's pretty... 700,000, sorry, forget a few zeros, documents. So yes, that's part of the problem. And about Blacklight, so it's in Ruby on Rails, I have nothing about it but it was pretty difficult to modify the interface and I will come back on it later, but the web interface, the user experience is pretty important for us because journalists, but probably researchers, are not tech savvy at all so everything should be well-sing to make their life as easy as possible and our life is easy, too. And another problem was maintainability, so about the bricks that we had but also about the work that the journalists wanted to do because they already have a tool that in production, the one about Blacklight, that was working so we had to develop a new tool but without any break so they needed the same functionalities without any end of service and it was exactly the same documents and so that was part of the issues that we had and then the challenges, so as I said before about the huge amount of documents that usually get at the beginning of a leak, they get all those documents and the first question is okay, where should I begin? So the, let's say, new challenge of Datashare would be to give them some ideas to how could they start and have a global overview about this leak really fast and to answer some questions, just like easy questions, just like how many documents but also how big is the document, speaking about geographical location or people included in those documents, so just like a global overview so what's happened in these leaks? So that challenges that journalists encounter really often but I'm pretty sure that researchers have quite the same tools so that's probably one of the reasons why we are here and to answer all these challenges and faces all these issues we developed Datashare, here is the new schema of the tool as you can see part of the workflow is really the same than before but so we succeeded in putting the experienced breaks that we had just like extract in the new tool of Datashare so there is extract here at the really beginning of the process at the scanning and indexing part and all the data extracted by extract just like the well name will be pushed into an elastic search index so that's a new one and we needed to migrate the data from solar to elastic search because it would have taken months to re-index all the leaks otherwise and you can see here that we have the NLP so at the really bottom that's the NLP pipeline so we just get the five NLP pipelines that we had and developed a common API to be able to interact with it and so I don't know if I mentioned that before probably not but the NLP pipelines is just a way to collect the named entities in the document that we extracted before so yes that's it, that's the new Datashare and I think now it's time for a demo let's cross fingers yeah it works so here is our demo website about Datashare probably I should mention that it's an open source tool so everything is on GitHub and if you need to use it on your own corpus of documents it doesn't have to be big but if you just have the document that you need to analyze it's free for you so here is an example on the LUX leaks which is a leaks that we revealed a few years ago so here is the list of the projects so you may have numerous projects here or several projects here separated and if I click on LUX leaks I have first the number of documents in the leaks so that's pretty important and here is the list of documents so that's the first step and here on the left you have the facets so we built some facets on the elastic search and those facets are the response so as a response about the global overview of the leak and of the document so by example here I can click on file types and I will see how many so the different types of documents that I have in my corpus and how many of documents of each type so here in this corpus specifically I do have only PDF so yes I have thousands of PDFs in that corpus and I can do exactly the same for the languages so I know in one click I can discover that I have some documents in English, French, German and Italian so it gives a first overview of the languages of the documents in that corpus and about the NLP pipelines here are the results so all the name entities extracted sorry that's about people localization and organization so here if I click I can see that in my so first in my more than thousands document I will have 47,000 different people mentioned in it and one of the main mentioned is the call Mr. Call which as you all know is the main shareholder and stakeholder of the leaks but we didn't knew it before data share and before reading the documents so that's a good add for the corpus and I can do exactly the same with the locations and here I will see Luxembourg, US, UK which are the main countries mentioned in that corpus so that's data share and that's the way we developed it so that was the past and the present of data share and about the future we would like to develop lots of things just like adding comments on documents because I also have access to documents so we would like to journalists to be able to put comments directly on documents and creating some dashboards on each project to have a really quick overview of I don't know how many documents on which type and about the creation date just like a simple histogram on the creation date and we would like to redevelop the installers because now it's based on Docker and we have some doubt about it because we have lots of difficulties with Docker and the point is that journalists I said it before but that's really important for us that journalists are really not tech savvy at all so one of the most important point is to have a well-developed well-stocked interface and easy to manipulate and easy to navigate and dig into the corpus and easy to install on machines and on computers for everybody and we wanted to develop a plugin architecture to let developers to be able to develop their own plugins so that's it and I hope you will find the usage for that tool too and feel free to open issues which one? the Edward Snowden corpus? no because I don't think so sorry yes so if this tool has been used on the documents released by Edward Snowden so I don't think so because first the tool is quite new we released the beta version one year ago and I don't know if those documents are public or not so maybe that's the reason not as far as we know now they are not using it they should yes we've heard about it I'm not sure because it does quite the same than us I think but not as good as us sure there are lots of tools that so it's not a new need because it has been the same need for years and years so Open Semantic Search is doing more or less the same and there is the one from OCC European Aleph is another tool that do more or less the same but I think that Aurora is the prettiest so the question was the programming language used so the question was how many languages are supported by an LP pipeline how do you figure out which one it is? oh yeah it's a question of the chicken and the egg because when we are detecting the language with a small framework that is called a language detector and the problem is that when we are doing OCR then we have to provide a language and we have to run OCR to know what language is the text so for the moment the OCR we are using Tesseract and it's supporting I don't know I think it's 30 or more language languages so we are extracting text from a lot of languages but the extraction of name entities are ranked on only 5 languages or 6 now I think there is the Chinese and English, German, Spanish and French then the question could be also do you have features to let journalists declare that they suspect that the name entity or condition algorithm failed or all the OCR? so the question was is there some feature to let journalists know that the extraction of name entity failed and now it's a feature that we are going to develop is how we are consolidating all the name entities for example knowing also that Donald Trump is the same as D Trump and it's like open refiner could help us for example what kind of hardware do you actually need to do this whole process? like a desktop computer? so what we are trying to do is what I call total scalability is we have built because we have partners that are in countries where there are not a lot of resources like Africa for example and so we made a version with the embedded elastic search that could run on Raspberry Pi but at the same time we want it to be able and we are running the extraction of big leaks like Panama Papers on cloud computing on 30 or I think the most 30 nodes at the same time sorry once I have imported the whole corpus like here in horizontal can I extract the network co-occurrences of name entities from the documents? no so one of the sorry yes so the question was from a document can I extract How do you say that? Corpus of co-occurrence of, sorry? Co-citations, network. Co-citations, no network of name entities in documents. So the simplest answer is no. And the longer answer is we plan first to be able to export name entities as CSV. So that would be the first step. And after that, you will do your own tricks.