 Can you start? One, two, one, two. Okay. So, hi everyone. I'm going to introduce you to a tool for a researcher to free your papers. But first of all, what is a researcher exactly? The systematic investigation into a studio of materials and sources to establish facts and reach new conclusion. This is kinda confusing. So, what is exactly a paper? So, who was there at the keynote on the gravitational waves? Okay, so, we could say this is really a breakthrough what was discovered on this day. And, naturally, there was paper published for it. So, we can all take a look to the full text here, which is filled out scientific details and mathematics formula. But, what do we do with papers? Well, researchers read them to inform themselves on what's going on in their field. Or people who read their thesis with it, with them. We could also, as developers, use them to read software, for instance, machine learning systems or databases. But, there is a catch. Research is financed from public money and published through companies like Elsevier or organizations like ACM or IE. And this publisher decided to keep the papers behind the paywall, so you have to pay for it. So, this is kind of close source. But, it creates problem. Well, students can access those papers because very school pays for a subscription. But, we would like to have open access. Open access is really important, because over people who are not student cannot access to the paper and have to pay really close amounts of money for only 10 pages, which is not really fair. So, it's time for the game, and people who are student in this room cannot play it, unfortunately. So, publisher edit journals, and those journals contains papers we have to access. In order to access the content, we have to pay subscription for it. And, the most famous one is nature. And, in your opinion, how much do we pay for subscription per year for only one journal? Does anyone have an opinion? 50,000? No, too much. So, sorry? Too low. So, we pay over 10,000 per year, and you can also have journals peaking at over 25,000 of dollars per year. This is insane. But, what is really interesting is how much they made a profit. So, according to right research, Elsevier has earned 31.7% of profit margin, which is what? Which is what? So, let's compare it to a big company. What was Google approximate profit margin in 2008? Anyone idea? Well, 30.6%, which is less than a publisher. Okay. So, let's summarize all of this. So, researchers keep up with the state of their finding in their field. They have new insight and ideas by reading this paper and understand them. So, they read down these ideas into a new paper, and then they submit it to a publisher. There is a system which called Peruvian, which tries to filters out which content could be considered as scientifically correct and as a breakthrough. Finally, if the paper is accepted, our sense has progressed. But in fact, there is a third party which we often forget, which is the publisher. And publisher gets money every time at each process. But, there is a problem. Why this is important that publishers don't gain too money from that? Because we have subscription which are becoming more and more expensive. Like you said, like you've seen, the price is really high. And the fact that we are giving away our research for free for the reason of perreviewing it, makes them more profitable. So, why shouldn't also students from developing countries have access to these crucial papers which are behind the payroll? Because they don't have the money to pay for it? And why cannot people who are not students but maybe developers or any kind of field cannot access these papers? Because they're not students. They don't have universities to pay for this subscription? Well, this could be you if you are a developer and you cannot access papers. You could run into a situation where you need them but they're not available. So, what can we do to improve open access? Well, we created this thing which is a tool to give control back to researchers and we would like to promote a global open access policy with it. So, how do we do that? We fetch your papers from different sources. Jubilant Core, Bayes, all kinds of repositories, we fetch them and we check the policies on these papers. Can we open them? Can't we open them? Can we open only the preprint version? And then we tell you what you can deposit legally. So, what does it look like? Well, this, this is a page where you can deposit a paper. So, you can see what you can deposit and what you can't deposit and this is pretty simple for a researcher to actually open their papers. When it's done, your paper is free and accessible by everyone. So, this is great but to give you more insight, who is behind this I mean? This I mean was an initiative from a group of students of the Econorma Superior in France. We are a non-profit organization participating in many open access related projects. We worked with Wikipedia for an open access bot. We will be at the open com. So, maybe we were telling yourself but this is a Python talk. Where is Python in this story? Well, this I mean is of course written in Python using the Django framework. So, we're using PostgreSQL, PostgreSQL to start papers and their metadata data. So, we're encountering some challenges I wanted to share with you. First of all, the papers. We have more than 15 miles of metadata of papers and we are still getting more and more metadata for new sources. But we have a problem. How do you fit this amount of data in your store? Well, we get PostgreSQL and we used its powerful JSON field. So, PostgreSQL and JSON field. Well, implementing JSON in a model is the rest of the matter of importing a model and use it in your model, which is really awesome. But what is this answer is how PostgreSQL handles them? Well, you have indexing for free, which works on so fields. This is super efficient and can be your NoSQL world for a while if you don't want to buffer. You could avoid very complex join and you can access some fields in queries directly we are having to fetch the whole record. So, JSON field in PostgreSQL is really a good solution when you don't want to implement a NoSQL store. The second challenge is we need to have search and it has to be fast. So, that's how a researcher can get more feedback really easily. So, we tried to keep PostgreSQL for this kind of task. We looked in its search engine but it was not sufficient for our amount of data we had. So, we entered Haystack. Haystack is a Python library for Django to provide awesome search tools. First of all, multi-backends. So, we can have elastic search, solar, whoosh, zappian, even we can use them. We can like configure a master elastic search with a backup solar, which is really cool. We have faceting, we have real-time indexing which is important when we're getting new papers. We want them to be indexed right away. And we're still working to make this faster because it's really hard how to maintain all this metadata coming from many sources. And the third challenge is the most hard dance in my opinion, how to pre-unduplicated papers because we have so many sources which provide slightly variation in the metadata. So, we need to have a solution for that. So, we tried a fingerprinting technique. We have a function which takes a paper and reduces it to its minimal form. Remove the diacritics, lower case of everything. We sort the authors list. We simplify the title and finally, we compute a hash on it and store it. Then, if we have a similar offering we are printing our database, we can just merge the paper and get more and more metadata on the paper. So, this technique is working more or less but we always have some cases where we don't have the title because some sources don't require you to enter a title for paper, which is absurd. Anyway. To close on the challenges, we have many more challenges around machine learning to disambiguate authors' names. Parry from title-clining, from Latin markup, infrastructure script, we have already Vragrant for development but we would like to get Ansible or anything to push in production in a more efficient way. We would like more deposit interfaces and sources to support more universities and more use cases and how the repository is filled with interesting issues and we need your help. So, to close also on open access, we have many projects around this emin, like Proxy for Digital Object Identity, an open access bot for Wikipedia, a crawler for repositories, O-A-E-P-M-H protocol implementation, which is a protocol to fetch papers in an efficient way. And a bit of inspiration from an over-lighting talk, which has been done by Lassie from the co-op team. I want you to do something at the end of this talk, which is through developer industry in open source, clone this emin, run it using Vragrant or anything, try it out and deposit fake papers for fun, take an issue and submit to pull request if you can and if anything goes wrong, blame us. And if you're a researcher interested in open access, you could talk about this emin to everyone of your peers, you could person them to open their papers because this is really important and most of all, you should open your home papers if you have them and if anything goes wrong, complain to us also. So thank you for this talk and thank you, Python. It was really a great conference. If you have anything, you can contact us. Thank you very much. Do you have any questions? Hi. So I'm interested in how you are funding yourself because I would imagine that going against companies like Sevier, Springer and so on, isn't a trivial task, especially for a couple of students. So like we said, we are a non-profit organization. We receive some donations, but we don't have so many costs. So we don't have the need for a lot of funding. We're going to get some funding from repositories in France, Al, which is Archive Overt. But to be honest, we don't really need, we don't really that need funding. So we don't have problem going against Elsevier or Springer. Does that answer your question? Do you think of storing the paper and already open storage for them like Archivics or All in France? So why did you choose to store that in your database? So as far as I know, RXEV is only a store for preprints, right? Not for... Go ahead. Yes, so we offer the possibility to store preprint, postprint and published version. The thing is also, I don't know if RXEV and those kind of repository supports the way we are fetching the policies so we can tell you what you can deposit or not. Right. So we think we are trying to get also papers from RXEV and Overt repositories so we are trying to unify all repositories. We are not storing anything. We just use Xenodo, which is an Overt repository financed by the CERN. So we think we are different. Hi, me again. I would be interested in your faculty position on that because it seems that in the research business there are multiple problems where scientists are complaining, that's one of them. So I'm more interested in, are you iterating with your faculty or do you set your goals and priorities by yourself inside the student group? So your question is about how do we prioritize what we do? So if you are working with your faculty with professors assistants and so on or if this is just your student project? So I'm not really a student but half of the contributors in the discipline group are students. We have, I think, a professor now. The prioritization is based on what are the use cases of universities and what do they need to make their repository better and also how can we promote in a more efficient way to open access. So you could say this is my student project in the sense that I work on it because I find it really interesting but for other people it's really important and crucial and I understand that. So we are trying to prioritize some issue in our issue tracker but if you have a really huge use case and you really need just send us an email and we will talk with you to see how we can make this happen. Hi, great talk. I would like to know more maybe about how do you handle the duplications on the, do you use only the title for the fingerprint? No, no. For the duplication problem we are using a lot more data. I don't have the algorithm in my head but I'm pretty sure we're using the title, the auto list, we're trying to sort it so that this is a deterministic sorting. We are trying to simplify all data which could be removed by one repository but kept on another repository. To be honest, how's this open source? I can suggest you only to take a look to it and I can give you the file after. Why did you decide to build something new rather than use an existing open access tool such as dSpace, ePredence, or Fedora Commons? For using what? Sorry. For using existing tool like dSpace which has been around for about a decade. So I'm not really aware of what dSpace does exactly but someone already asked himself the question so I think we found a lot of flows in dSpace which we didn't really want to keep so it's the same reason for whenever you create a new tool it's because the over one is not sufficient at some point. We think. Any other questions? Let's say I'm a researcher and I put the final version of one of my papers but I don't have the right to do it. Are you taking a legal risk because you are hosting it or not? Well, we have our terms of conditions specifies that you must have the right to deposit the paper on the website. Yeah, but let's say I don't care about the limitation and I just do it because I think it's the right thing to do. So I don't know if this really happened before but if it would happen we won't be able to detect it automatically or find it ourselves. So until we get an email from someone saying to us oh, you're hosting a paper which should not be open and we would have maybe a Cess and Desist liter or anything like that and we would remove the paper. Fortunately. Any opinions? So thank you Ryan for your talk. Thank you very much.