 So let's move on, the next speaker is a professor at our own university, the University of Lennart Mountains, and he's really into science. He even started an early data repository some years ago, Pride. I won't say much more about it, but he's an ideal person to talk to you about this, so let's give the floor to you. Thank you very much. So I was asked to talk about the benefits of open science to you today, and to me, actually, that's an almost silly question, because the way I've always thought about science is that it's like this, right? It's like these pastel builders in Catalonia. We are really trying to make things work by helping each other. When we build upon things, nobody starts from scratch. And that would be wasteful, and that would be extremely inefficient. So we always use other people's work as a starting point, right? It could be a starting point for criticism. It could be a starting point for building upon it. It could be a starting point for inspiration. It could be many things, but we always build on each other. So why on earth would you sit on your stuff, right? And maybe it just doesn't make any sense. Still, I will try to convince you that you should share our work, and we'll talk about what does this entail. We already had a few questions around that. That it's not easy to make sure that people can share stuff. And then the main message is, what can we potentially do once we start sharing our work? And so that's something that I've been active in. I've been building a repository for data ever since 2003. So I was an early adopter in this kind of thing. And that has now become extremely successful. So I'm very happy about that. You get very successful after I left. So that's maybe something I shouldn't be too happy about. I'll try and talk a little bit about something that many people don't think about when they think about open science. Let's see if I can convince you of that. So why should we share our work? So usually people start sharing, or they start the debate about sharing data based on a negative feeling. I'm in the life sciences, and the life sciences very often come out scoutily because the things that they say don't work and have these high-effect applications that don't make it to political drugs and things like that. But the bottom line is that this negative attitude is getting us in no way. If you say that we need to have open science early so we can prevent some cases of fraud, which unfortunately we all know happens in the sciences across all the board of all the sciences, or to get more quality work to be stopped from publication, that doesn't work very well. So it's the same as peer-reviewed. If you expect to do three more people who in between all of the other stuff that they have to do catch all the problems in a particular publication, it will not happen. So this is the same. You can put all the data in the public domain. There is so much data in the public domain in certain fields that it's impossible to scrutinize everything. So that is not a fail-safe mechanism. It works a little bit. It prevents some of it. It makes it a little bit harder to fraud, but the fraudsters will still try. On the other hand, we do know about many cases of fraud, and in very few of these, as the fraud being exposed through open, whatever, open data, open publication, usually the data is not known. It's just the stuff that's in the paper. The paper tends to be in a closed-access journal, and none of the code or metadata that was used in the processing is actually readily available. So people pick it up based on other views based on other things. They're very well-consent with the disease. So this, I think, while it works a little bit, it's not a big thing. The big thing is that we should focus on the benefits. We saw that the previous presentation was very clearly that we see also as distribution that's focused on what can happen based on this. So I've got all the usual stuff and two special things that you may not have thought about. So first and foremost, it makes it accessible to anyone. Don't let it say anyone. This doesn't stop at researchers. This is literally anyone. And if you want to know the human genome today, everybody can open their browser and go and look at the human genome. It's open to anyone. We are not elitists. We're not saying this is a closed system for science. This is for everyone. Second thing is you can build much more efficiently on other people's work. You can literally use the data on somebody else. You can literally take the results from somebody else and go forward with the fact. It maximizes the usefulness of every euro that has been invested in our research. Because if you use the money, do something, and then something sits somewhere on a hard drive in your closet, that's the end point of that euro. But if you put that in the public domain, everybody can start using it for whatever purpose. So that euro actually stretches a lot further. And I'm sure this is something that also is very well understood by the EU. This is an important one. It's a bit controversial. But if I've learned anything in science, it's a data test to outlive interpretation. This means that actually when you do research, make a bold statement, your data is much more important than the paper you write about it. Which is the exact opposite of the negative viewer. We say the data should support the paper. No, the paper is just what you get out of the data. It's an extremely limited view of what you have done. Right? I'm not saying it's bad view, but it's limited because you have a particular question in mind. You ask it off the data, you publish it. And there's a lot of data out there today that can do a lot more than just answer that one question that you have for these few questions. That is an important thing. Plus, the methods that you have available to yourself today, they could evolve very dramatically over the next few years. Which means that the interpretation you make today might look naive in five years' time because you have more advanced methods to look at the same data. And the problem is, a lot of people don't want to share data because they feel this and they say, I don't want to share the data because then somebody else can go and mine all the gold I left behind in my data. The interesting thing from my experience, at least in my domain in the life sciences, no one ever does that. They're all too busy generating new data. They never go back to the old data. Nobody ever does that. But people tend to have an intuitive feeling for this, but it's always a bit harsh when somebody tells you that your data is more important than the beautiful paper you wrote about it. Finally, and I think this is the key thing, this is my big message, if you remember anything, remember this, open science actually fosters creativity in a non-precedented way. It opens up so many possibilities and so many opportunities that the research that comes out of these kind of things tends to be revolutionary because rather than do what everybody else does, take a little bit of data and do something with this, now you have all of this data and now you can do crazy stuff. You can do things that you would never be able to do on your own. That's where the real problems of this stuff lies. Unfortunately, I cannot predict what these crazy things will be, although I've done a few of them. I'll show you a few examples, they're very limited, very crazy stuff with all of this open access data. If you're at this time, this moment in time and you're a young researcher, I think the best you can do is figure out ways on how to make use of this open science because it will happen. When you see that NIH, the EC, everybody's really pushing for this, the welcome trust has been doing this for a long time. The biggest fund in the world is really pushing this, it's just a matter of time before it's everywhere. You cannot even think about fighting this anymore, there's no point. What you should really start thinking creatively and say, what can we do, how can we maximize this? You can build an amazing community on that. So what does this entail? It's a very brief bit, I don't want to go into too much of the details. The problem is that this is work and sharing data is not something you do as an aside. The EC also understands this because I now force you if you're in this pilot a data management plan. So I'm in a few EU projects and obviously I'm the prime candidate for this plan because I'm supposed to know how this works. So I'm actually doing this for a few projects. Now having said that, we did not voluntarily enlist an open data pilot. So we're not only 11% that did, but we did volunteer to take all the other requirements. We did promise a data management plan, we will make everything available. So it could very well be that we ask the EC at some point to be included in the open data pilot, but we are researchers, we are careful. We know how this goes. If we feel that we're on the right track, then definitely, but we will try to do everything but we won't promise it in advance. So that's the way we work. When you submit data anywhere, you have to include metadata. This was also on the slides of the previous week. This is very, very important. I'll show you some examples of how hard it is to get metadata. Second, when you write code, it should be understandable, it should be documented in some examples of what this means, especially in the last page. This is tricky. People who write code to process large amounts of data they tend not to want to document it. I have all of these people in my group and it's really tough. It's extra work, but it's extremely important. If you want people to actually make use of what you do, you have to provide context, you have to provide additional information. You should provide all your protocols, clearly and in full. If somebody else at some point wants to reinterpret the data, if this is missing, then the value of the data is diminished quite dramatically because they only have a very personal idea of what the data means. If you write beautiful papers, you should always link your interpretations to your data in the public domain. So when you say this claim is based on that data, that that is obvious and that this can be verified. This is extremely important not because everything you do is wrong but because you may have made a mistake. Everybody makes mistakes. A few Nobel prizes have been awarded to people who actually went after some of these mistakes. The most famous one is Richard Feynman who went after a single mistake in a paper that everybody in the field just believed but he dug deep enough and then he developed quantum problem dynamics as a result. So these kind of things this full confidence is extremely important also because it helps other people understand which of your claims which of your interpretations are valid and how valid they are. So it makes for a stronger foundation for the future of science. And then finally, and this is very often forgotten that you actually have to think about licenses. You have to give your data, your code, your papers a license. You have to learn a little bit about this. I don't think we live in a world anymore where you can safely say I have no clue where to create the commerce licenses. If you write code, you cannot say I don't know the difference between a GNU, GVL and a passion license. You cannot live like that anymore. So you really have to know. Fortunately, it takes five minutes of your time to figure this out but you should figure it out. What we do in practice this is for one of the EU projects that I'm running now which is to build an open data exchange an exchange ecosystem for some migration data which is something you don't want to think about. The idea is that you follow a certain pattern. You have these minimum reporting requirements that tell people how to discuss the data. The minimum reporting requirements are very important and it could be implemented in the form of a materials and methods section in a bit. It's how you do stuff. You just describe it while this is metadata. Controls for categories are ways to make sure that everybody talks the same language. If everybody would use their native language to write the materials and methods section it would get very, very difficult to interpret other people's protocols. This is the same. Only we standardize the words in the English language that you choose to use to describe certain things. And then finally you need your data and your metadata to be in formats that can be easily readable by computers. So they have to be standardized. They all have to look the same. So not everybody has their own little type of Excel sheet or worse yet PDF document. I used to have this slide where in my field all the supplementary tables with all the proteins that people identified and there were hundreds where in all these different PDF documents. I'm just showing you all these PDF documents. This is a nightmare. So that is not useful. And then finally it has to go somewhere to live this data. It has to go to repository of some renown and of some stability to not get back to that. So I've been writing about metadata. This is my database existed by data. It's a database I helped start existed for 10 years. So I made a retrospective. I looked at metadata. The only thing you want to look at is the category unknown. This is the instrument, the mass spectrometer that acquires the mass spectrometry data. It's the most fundamental piece of metadata. It's which thing you actually put your sample into and the data came out of. That's it. That's what it is. Unknown. Look at this. As the system becomes more popular, more and more unknown data. And then suddenly boom, turning point. You see that? What is the turning point? There is now a Q-rate that goes through every submitted data set that does not say which instrument it is. So why is it declining like this? Because the data gets submitted somewhere here takes a while to get published and this is the publication date. So some of the data. But you see how this completely eradicates the problem? But this is asking people give us the name of the instrument. It is literally five seconds of work. It's a graphical user interface. We don't have to select it from a dropdown list. The worst thing is when we go around about this time where there was very little control on this, it turns out that the most also selected instrument is the first one in the dropdown list. There are nine in the dropdown list. There's too much effort to scroll down. So this is really problematic. This is a really big issue. I don't understand why that is. I think people are just irresponsible because that's the only thing I can and the only excuse I can think of. Exactly the same when we talk about tissue. This is data that comes from humans or animals. So very often it has a tissue origin. Here you can see how much data comes in and this is those with annotation, those without annotation. And you see as the data grows, unknown tissue grows until the curator starts. The curator is called Attila. I have now called this the Attila effect. So the Attila effect kicks in and you see this drop. But isn't this amazing? These are two nodes. One, two in a graphical user interface. It's even also completing like Google. So you start typing here for instance, liver. By the time you've got the liver it's going to give you liver as an option and you just click it. It's too much effort. So this is very, very fundamental. This is human engineering. The weakest point in this entire thing is the human who has to do this. It's really strange. Then sharing information effectively is really tricky. We had the question about how do you fund this kind of data availability. In the end the NIH and the CPI actually funded a big repository to compete with the one that I started. This is behind. They got a really big mention about technology. Very, very about it. And then the financial crisis and they shut it down. So they literally shut down the repository and this is the page you see when you go there. The page is no longer available. Thank God this is the NCBI and also an institutional website maintained by a post log. And actually they do provide everything as a download. Here the FTP download. And the database that I founded, these guys, they actually stepped in. They found the European Environmental Institute and they rescued all the data. They migrated it all to their database and the data is still there. You can still find it. But this kind of stuff, this is really tricky. This is stuff. This needs fundamental funding at the basic level from a thorough bunch of countries like the European Commission or from a big country like the US to keep these things operational. This is not cheap and this is a decision that people need to make to fund this. Both NIH and the EU in general fund this quite well and so we can't really complain. But you have to keep that in mind that when you are too lazy to specify your instrument all of this beautiful money is being filmed in the garage in the century. It's really irresponsible. The same thing can happen to codes. For open source codes this is much more popular. This has been popular since the 70s, right? This is Richard Stallman's whole thing and then all the variants of that. We make all our code free available. This thing was published in 2015 and it has been open source on the Internet since 2011. So we even make it open actually before we publish it. Well before we publish it because we want to publish only strong tools that actually work. So that takes time. So scooping and this kind of thing does not exist. This project was in Google Code. Google Code which is hosted by Google just stopped. So we had to migrate everything to GitHub. So if I got Google Code made that easy. So now our page is on GitHub and the whole page still links. But these kind of things you also take responsibility. You can't just say I put it on Google Code and now I run away as fast as I can in the opposite direction. You have to take responsibility for the work you've produced and keep half an eye out and if Google emails you that you might want to migrate to GitHub using these tools you have to invest a half a day to do that to keep your stuff alive. Or at least you should enable other people to access your code at the administrative position because otherwise nothing happens. So unfortunately we can say open data, open code is all easy and it's not. You have to invest a bit of time. This is not an endless amount of time if you have to be aware of it. Anyway, so I've lectured enough about this. Let's have a look at what we can do about this. So this is the database for those of you who want to know this is where I left it. So you can see it grew quite dramatically after that. Of course it's an exponential all the way, right? But it's just flattened because the scale is so close. But you see this is growing pretty well, right? This thing is growing, I'm crazy. And you see two curses. You see all the data we have. These are mass specter and you don't even have to know what that is. We have all of them and these are the ones we understand. Which is only a proportion of everything. So the first thing that my group did was we did this bit and we started doing crazy stuff like that. And without going into any detail we don't know a lot of things about doing crazy stuff like that. So we found the first in the world to do this. But we really learned a lot. We saw all kinds of crazy patterns in the data and that's essentially what I'm showing you. Crazy patterns. Which was really, really useful for us and which really helped to feel the world. And it helped establish my community as a scientist. So again if you're a young scientist this is very promising stuff. But this was just the start of it. That's all about the data. I haven't talked about the open goal yet but that's another thing my group benefited from. These are all algorithms that are made by other people everywhere in the world and they are free and open source. And they are search engines. They are the kind of algorithms that take this data this mass spec data and transforms it into something we understand. And all these different ones all have different properties. And what we did was we built a tool that joins them all together and we can do that. We can redistribute all of these tools because they are all free and open source and this tool, the stupid circle tool has now become extremely popular because it gives everybody in the world and that includes you if you're so inclined the ability to download this thing and use all of those tools in a very nice interface. Interestingly it shows that open stuff allows you to have specialization within the research field. The people who built these algorithms they are more mathematically inclined and more smart than I am. And so they built beautiful algorithms in an interface. They write a command line interface which you type in a black box literally a black box. Very few people in my field that the actual webline researchers can use these tools. We are specialized in making them accessible and the combination of the two is extremely powerful. These people specialize in the hardcore algorithms they make beautiful ones. My group amongst others specialize in making them available. Very closely with all of these people we can make something that actually works for the field. So the standing on each other's shoulders it also shows a layer in how these fields can organize. So you can find a niche where previously you had to be a person who was able to do everything. So another thing that we did was we built this other tool it's called Tactite Shake and it has this pride reshape pride is the name of my database my database I have to change my look at because I helped found this thing this button when you click it you start to see this so you click the button and all of you can do that if you google Tactite Shake you will find it you download it you click that button you see this this is the database with the known metadata of projects and any number of files and you click this button and then of course this screen where you can select your search engines and you can reinterpret the data and it will open up in our evaluation tool it will apply all kinds of statistics and it will show you the results so now with 4 or 5 mouse grids anybody in the world can take that data and re-analyze it in any way they see fit and that is really interesting because I've showed you this before think about this there is a huge gap between all the data we have and all the data we successfully interpret in order to close that gap we need more ways of interpreting this data and now everybody can do that everybody is capable of doing interpretation so now literally the imagination of the users is the limit anybody with a good idea can now test it immediately on the public data this kind of thing is actually quite rare we now have the easiest way of going through this cycle of data but it does exist in other fields of the life sciences I think it's also pretty well known in things like ecology but there are other fields where this is completely not known but why would it be missing why would this not be standardized everywhere and in fact my students made this this is not something I made my students made it but you can see I have a successful profit and the third point is that when you do an experiment generate some knowledge you publish it and you put it in a database in the repository what can happen is that private people can now short circuit this cycle they can skip new experiments because there is so much data already they can do a few in the computer they can take the existing data reinterpret that, get new knowledge and you can do this very often because of this so any field where you see this discrepancy between the data you collected and the potential uses you can put it do in private candidates with this kind of stuff it's so trivial now the way we did that was we built this tool I'm not going to give you all the technical details but we spent a lot of time building a tool that can automate this and run this on a cluster so it goes automatically and we can extract new knowledge and we've done quite a few interesting experiments I'll just give you one because it's funny it's this small open reading frame so you know the genome when you have all these genes on the genome they find the genes with computers and from here to here there will be a gene the problem is these very small open reading frames so very small genes they are missed because the computer doesn't know what to do with them so according to my year at university here the management team made a database of all these suspected open reading frames so suspected small genes that nobody knows whether they are real or not and we just took all of these genes and matched them to this database of all the proteins we've ever seen in the mass spectrometer and whenever this thing gets white or yellow it means there's a lot of them and red means there's very few of them and what you can see is that we find a lot of them and they seem to be organized by the origin so in blood there's different ones than in marrow say or in breast and unfortunately there's a lot that are unknown and you see this lack of make-up now that come back but these are the kind of things that nobody would have deemed possible a few years ago because in a single experiment the best you can do is find one or two of these but because we now have tens of thousands we can literally scan the whole human protein and look for these kind of community vans so we're doing that over and over again and we're finding a lot of amazing stuff this way and this is extremely useful because it's a direct annotation of the original we're saying from here to here there's a open reading frame that is 99% certain that is expressed or found mostly in blood this is found mostly in kidney this is found mostly in breast and here some idiots have gone to tell us that we don't know but you see the repercussions of this this decision that somebody makes to not spend 20 seconds of their life really has an effect the message here is that once you're in the repository it's not the end it's not an elephant graveyard the data comes from there and you should re-analyze, reinterpret work with that data and then do all kinds of crazy stuff with it show that this data can live and this is what happened to me a few years back a sociologist came and researched the way that my group works and the stuff that we did which were very scary and then they read the paper about she said a few really nice things she said such data collections these are the data collections we were collecting can be mined for valuable information that could not be obtained in any other way that's extremely insightful if you have a continuum of data it has much more power than an individual data set it has much more reach and that is something a few people realize but you should second thing is and this I only include because it's a bit funny she said this is a way to reactivate sedimented data it's a very colorful way of saying things if you're emitting to a water treatment plant where there's sediment filled out seriously you don't want to be there it's the worst place in the world and so reactivating sedimented data I don't know it doesn't work good the data sinks down and we reactivate it we reactivate the sludge in that respect your data will be the sludge and I'm here reactivation again I like this word because this really is what you do and now this is the socialist message but it reads through these data that we produce or collect this property and not the by-product of publication it's the other way around the publication is the by-product of the data and your data is longer than you do so I think what people should really start doing is thinking about the opportunities we create with this open science that is coming anyway I mean the debate about open sciences I think is fastest what should be the debate is what can we do and you can start by thinking what can I do with open science what can I study, what can I learn what opportunities now present themselves if all the data let's start with your own field are more available online once you've gotten used to that you can leave this out you can say early data from anywhere is available online what can you do when all the code in the world is online and all the algorithms are out there what can you do when all the publications are open access so these kind of questions we should ask ourselves because most of these opportunities are not like typical research little incremental steps they change the way we do stuff and they can be quite revolutionary so I'm really looking forward to the next few weeks I finally I'll end with this nice picture I never understood this right dragons always get lots of gold and then they go and sleep on it there is no point why on earth why bother getting the gold so you don't want to be like this guy but you want to be like this guy who tries to do something original with that gold and the metaphor actually is imperfect and you know the only thing that's imperfect about the metaphor is that this pile of gold but the data is not when I use the data I do not use it up there is still data for you to use and for anybody else to use so data is in a way infinite it's an infinite treasure and we don't need dragons to need more holdings thank you one of the real opportunities and challenges is to close the gap between how much of the data we have and the interpretation of the map I'd like to come back and listen about the standouts of this nation am I correct to assume that your work with Sentinel Ecom is that Sentinel Ecom is the standardization of European office actually we try to do the grassroots community standards first so we try to ask the scientists in the field what they want for their standards and we build it from the community so we don't start from a regular body downwards we start from the grassroots up it's a different process okay and is there any work on that because if you can publish to every researcher that standouts are going to be published or it being to be made for open data for publication that will be helping them just to close the gap between interpretation and data we have is there any work on this is it necessary or is it published or published is it accessible what is the word if you look at the standardization office and not regulation but the standardization office you've got European standard you've got technical reports you've got the specification something similar so the process that we set up it's worldwide because it's a community and we just use the sub migration community across the world and then everything is done in the open online so everybody can see everything everybody can participate in the meetings which are free to participate in everybody can register to the mailing lists everybody can look at all the standards and we have two cycles of peer review for each standard that comes out there's invited peer review it's like in a classical publishing model but there's also a period of 60 days where the standards are online and anybody can comment on the standards so we find a lot of links to dubious websites that are generated by bots but we also get reasonable comments from people and that can be anyone and anything or the standard to a license under very aggressive open licenses of course and they're all in the public domain all the time that's fundamental of course otherwise and the whole process is open to anyone interested which makes it slow and I'll admit that because when people come in after two years of work and it has the same question that was asked two years ago and resolved that is we have to re-explain but fair enough it's a slow way but it's a good way the nice thing is because of the growth grassroots a lot of people will be working with the standards will have invested in the standards and will adopt them and we have to do it where somebody comes in and say now you have to follow this standard which research is extremely difficult to get other questions otherwise I have a model because you mentioned you already made data management plans and work with them can you say a little bit about your experience with data management plans and how much effort it takes and what can we do as an engineer for example to help people like you so the data management plan I haven't yet made it and I'm in the process of making one so it's very early days however I've taken many notes about this I've read a lot of the documentation I came across the website that existed in the UK to make this data management plan and while it looks deceptively simple with the four key questions as soon as I started talking to my project partners about this at the kickoff meeting we actually had it last month a lot of really the main annoying thing is because I'm in the life sciences we work with patients and so a lot of the data is patient derived and then you open up a Pandora's box of problems because every nation has different rules and there are many nations of course in these EU projects and the privacy rules are different and the ethics rules are different and so the bottom line is that we probably have to say that at least for now all the patient data will never be public and never even leave the hospital so people wanting to analyze it have to go to the hospital and this is really annoying when it comes to all the other stuff like the standards and real research data that comes from cell lines or something where there's no such issues then it's actually really simple because we are lucky we have a field where standards are known, repositories are known so we will say we will use this standard this repository creation is not a new repository and this is trivial I can write that in half an hour the problem is really when we get to the data that is on some sort of protection by other rules and especially this patient derived data which is extremely difficult to work with because the rules are very complex now of course if Caroline can correct me if I'm wrong but the whole purpose of having a data management plan is to think about these things and write down what you think about it and of course not everything can be open to the psychological it's a question of thinking about it and writing it down is the data management plan also incentive to do that? Yes it certainly is but not so much for me because I'm in a spinal cord reflex now but I've noticed that when I discussed it with my project partners they really started thinking about this and I know for sure that a lot of the partners will now share more data because we will have the data management plan because I will attempt to do it then they otherwise will do not because they are willing to share more data but because it's an additional bit of effort and this will help us so I think it's actually very good that there is a data management plan and I know from the UK fund that they've had this for a while now where people actually commit to something the only problem is that it's sometimes hard to enforce this after the money has been spent the good thing in the EC it has to be there at 6 months which is very early on and you have the midterm evaluation if you don't fall on this at least a little bit now it's a pilot I'm sure they will not really court earth politics on this for the first round but it makes sense, it's a very good step and I think the main issue I have with this patient data stuff is not so much with the data management plan the problem I have is that I'm frustrated I would love to share some of that data in certain ways but right now the knee jerk reaction in most patients indulging the knee jerk thing about this is that anything that smells like potentially give any problem at any point in the future it has to be found in the safe behind 20 pages in a time when people post everything on Facebook but that's a completely different thing thank you for sharing this with us thank you again let's have a coffee break until well it's now until 5.11 ok see you back in 30 minutes