 It starts from raw data, which are in files, and maybe it's not even in files, it's on hard disks in your drawer. And they may be in repositories, but typically not. And some of the data would be lucky enough that it has been processed and somewhat cleaned up. They end up in collections, and where are the collections? Well, they are in your drawers. There are also some, sometimes, in public repositories or private repositories. And if they're really lucky, they're in some databases. Now, getting closer to publications, there is some data products that people produce by taking some of the data, and they process it through one or more workflows, different analysis. These essentially show up in public repositories, but also in databases, in websites just loosely linked, and sometimes as supplements in a publication. Now, on top of that, of course, are the publications themselves, which mostly do not have data associated with it. If they do, they can be horrendous. For example, I've seen this publication where there is a 146-page supplement which has tables in PDF. Okay, so what can you do with it? Nothing. You have to go back and sort of redo the whole thing of converting it into a form that we can process. Well, other things happen to the data, too. There are people who have developed processes of creating sort of fact collections, if you would. An example would be Drug Bank. So they have information about drugs, their targets, what the drug does, and they have compiled this information from various sources, and they have made it available as a website and also sort of as an internal database. There are other people who have taken the same kind of information, taken the more general things, and formally represented them as knowledge bases or ontologies. And by trying to make them formal, they have also uncovered some very interesting problems with the data, and especially with definitions of things. So Marion would tell you horror stories about trying to put anatomical terms and subcellular terms into an ontology, and the fact that definitions are not sometimes clear. So when we point to a piece of data and say this is about X, well, what exactly is X, and if there is a community debate about it, we have a problem. But the fact that people have tried that exercise makes this, these ontologies, also valuable sources of data. Two other things have happened. People have started annotating data and publications. They have started externally relating one piece of data to another piece of data, a publication to a piece of data. You know, Tim's group has this tool called Domeo, and the goal is, okay, if you have a publication, you, the reader, can take this and associate it with other things. You can say, okay, this is an antibody, and that's something we do in life using Domeo and other tools. The other thing that has happened is people are beginning to cross-link publications and data. So there is a thing called a link-out which relates a publication to all the data that might relate to this publication. So these auxiliary things are not really data per se, but they are in the space of data shared. Finally, there are many sources these days that can act as aggregators. So they take other people's data and make a huge catalog, if you would, pointing to various things. Now, in the business of data sharing today, we have all of these elements. So we have a lot of duplication. We have a lot of things that are represented one way in one place and represent another way in another place. But in this world of data sharing today, this is what we have. Now, the thing about the pyramid is that the top part of it is more shared than the bottom part of it. However, if you try to, the utility of the top part is different from the utility of the bottom part. If you're trying to re-analyze data, you really want to get those numbers. So if you are a bioinformatics kind of person or if you're just trying to verify something, you would want to get to the actual data. If you're trying to discover new things, if you're a new postdoc trying to figure out your ways, you would like to get the summarized data, the collections, the aggregates, the papers, and annotations of the papers. That's what you would like to get. So there is a need for a data sharing enterprise to cater to all kinds of needs, the needs of the information or the analyst, as well as the needs of the people who just want to get at their hands onto something and then they would delve deep. The interesting thing here is, did you know how many repositories there are of life science type data? NIF knows about 7.61 so far. I didn't even know that there are so many repositories. And surely they do not cross-reference each other. That's just a factoid. The point I wanted to make is, if you look at all the data that's there, there's a really uneven distribution of the data, the parameters of data. How much data is there? The data volume varies differently across the scales. The weight of data velocity, that means the rate at which data comes in and goes out, varies a lot. The nature of data, it could be binary data because it just comes from machine, or it could be highly processed XML data, or it could be RDF because it has been manufactured, it has been worked on sufficiently that people have formalized it. There's a lot of variability, similar variation in location and availability. Now, let's go back to what Kamran said. Here's the paper that he referred to. And the paper says that it's a pity that we somehow assume that the scientists have done the right thing and they have made the right conclusions from the data, but that's far from reality because we could not produce anything. Well, that's what Kamran already said. If you actually read the paper, they also say a few other things. And these are almost quoted from the paper. They said, well, we need a paper to report all data sets that are relevant to it, all, not selective things, everything. And it's important. And there must be an opportunity to say this is data that didn't work, so if there is negative data at whatever level of the pyramid, they have to be reported. The third thing the paper says is that there should be cross-linking. So if there is a paper that says something, there is another paper that refutes it, the paper that refutes it should really point back to the data where it found a problem. And similarly, if you support something, you should also point. That's what the paper says. So there has to be absolute cross-linking so that we know what comes from where. Well, that's a very reasonable thing to say because we want to produce, you know, reproducible science, reliable science. The things that sort of go with that is we need to get at the data of not, you know, five years from your initial experiments, but as soon as possible. And the more people that see it and the more people that re-analyze it and find errors, if any, you know, people make mistakes. Making a mistake is not a problem. But if you make it available, it's you and other people who would find it. And it's better for science. Now, of course, there is this. This is from a recent report. We said why data should not be shared. I just put it up there for fun. But I don't believe that, of course. But I have a suspicion that many do. But this is only one problem that there is this fear of intellectual property going away. The fear that, you know, if I cannot publish the hell out of the data, then I'm losing out and other people will benefit. There are some practical problems. One practical problem is this. This is data that we got from a group when we asked for their raw data. Well, I didn't really know. We didn't know what to do with this. And they were not telling us. So imagine the scientist who produced this data after a lot of work. If this person was asked to clean up the data and make it available to me in a proper way, they have to do a ton of work. And it's not simple. So when this happens, guess what happens to their IT guy? This is what happens. This is an actual email. Which says, please, please clean up the hard list because I'm running out of space. You are not cleaning your data. And I have only so much space. I don't know what to do with your next analysis because I cannot do it. So the problem I wanted to point to is saying that I'm going to clean the data, saying that I'm going to structure the data, I'm going to annotate the data. I'm going to make it available in a proper form. A proper form is not simple. It's a lot of work by not only the scientist, but an entire team that does various parts of it. Now let's go back to the paper. What the paper is essentially saying is the following. If you do an experiment, regardless of whether the data leads to positive results or negative results, store the data. Remember the plight of the guy who wrote that email. You may not have storage to store that data. So if somebody has to do it, there needs to be enough of a shared space where voluminous data can be stored. Annotate the data at least to the extent that we know what the data is about. Unlike the example I showed. Now if you are analyzing, tell us what you did exactly. So not only the results of your analysis, but the process of your analysis. Now if you have an analyzed result, well, tell us what the provenance was. And then if somebody finds an error or you find an error, point to the place where the error occurred and tell us that. And if you have any prior publications with the error, you make a reference to that. That's what this paper is suggesting. Now it's suggesting that for the scientist. For the publishers, it's saying if you have a paper which has a result, which has some future reference, you make sure that all the references are consistent. So plus it's telling the publishers, well, if you have published something with a reagent, I need to know the catalog number of that reagent. And Anita will tell you how painful that business is. So if you have done an analysis with a version of the data, you must at least give us the time stamp of the version on which your analysis was done. Now before you repository the system, this paper is also telling us, well, make sure that you make the data available as soon as possible, not three months after it was submitted. So if you are a repository and you want people to access it, don't just give the ability to download because it might be a truckload of data. If it's two terabytes of data, I cannot download it. So if you are a good repository, you would also give people the opportunity to analyze in place. The data must adhere to the right standards. It must be consistent. And it should be simultaneously analyzable by a lot of people. That's a lot of constraints. So when people say, I am going to... So here is my question. If we do this over all people, all scientists, all institutions, all repositories, public data centers, publishers, whoever is operating on any piece of the data or any derivations thereof, it's actually a whole lot of work. And what it says is that at the back of it, we really need a very scalable and elastic infrastructure. If we do not have that infrastructure, this scientist is not going to be able to afford one petabyte of data stored for all the data that that person has collected through his or her professional life. So we need the infrastructure just to think beyond just the scientist if we want to make data sharing a reality. It's a whole enterprise and a whole lot of sort of back-end infrastructure has to be in place. It also says that, okay, if you give the data to somebody and I'm going to just index it, but that indexing on a lot of data is a lot of computation. So I need to have the ability to compute it. Just in if the size of a Lucene index is over, it's about half a terabyte. It is just because of the volume of data we have. And we do not deal with binary data. No, we deal with structured, semi-structured data. So this business of having the right infrastructure, we should not forget it when we talk about the equation of where is the data? Why is it not being shared? Right. So the other implicit assumption is that we just don't share data. We expect a lot of services to be around. We want to search and query the system so that we can find the data. And here are some analysis that we do, some kinds of analysis that we do in this. People look for facts. People look for, you know, landscape service. People actually ask, you know, what kind of data do you have about this? Just want to know the data holdings. That's one of the questions people ask. So we also want the system to do active analysis. You know, do run this analysis on that data. So here are two examples. And what we are also hearing from this paper, and we do this just a little bit in NIF, is traceback. What we do in NIF is very simple. You have some data and you derive other data out of it. So you write views on top of the data, for example. If you have a chain of views from some original data and the original data gets updated, you need to know that. All the views need to know that. So wherever you have a data repository, where you have derived results, you have to have a traceback facility. Now, this becomes a much larger problem if the data is distributed over multiple repositories, multiple databases. And we still need a system-wide provenance tracking. It's not just provenance tracking of one database. Because remember, everybody's pointing to the right data. And the data that they derive from is not necessarily with them. It's with the originator of the data. So there are many groups today looking at different parts of the problem. We mentioned NIF because, you know, I'm in NIF. But if you look at INCS, it is working on standards. It is working on tools. If you look at this project called Elixir, it's looking at architectures of how a multi-institution system would work in terms of providing data, providing indexes, providing services, and computations. So the opportunity for data exchange looks at what are the requirements of every stakeholder in the data-sharing enterprise if you should go read their reports, and they are actually very insightful. So I have some questions. One is, can we really support all the data that's there today? And the answer is no. We need to think of an architecture, not just individual data-sharing centers, but an architecture which will sort of cover the whole thing. There are two questions that I'm going to focus on in the rest of my talk, how can we also create a monitoring and incentivizing method on top of what exists? Now, going back, I have to hurry up, I suppose, how many of you actually have looked at Facebook properly? So if you look at Facebook, it says that every message that you send is a data object, and you distinguish between a post, a picture, a response, a message that you send privately. So all data types have their own semantic types and a unique ID. And anytime there is a program running on it, we know exactly what program is running on what piece of data. This is maintained in a log in the Facebook file system. And any action that you do on it, so if you update something and then delete it, the system would know, and it would have to go to all other machines which has replicas of this and have to go and delete it. We believe that this distributed activity model is applicable to a data-sharing enterprise. So if you really want to monitor, you should be able to do things like this. If a data set has been deposited, you should be able to know, because that's an event, that's internal to the system. If there is a resource referencing another resource, and this reference comes in because it's structured, like data-site or DOMI, or some standard, you know that it's a referencing event. So you should be able to characterize the kinds of data events that occur within the whole data-sharing enterprise, and you should be able to monitor it. What happens if you can monitor it? You can count. You can count the frequencies and regularities of the various operations and the agents that cause it. Remember, I said agents, I didn't say people. So the agents include people and databases and software systems, anything that is instrumented enough that I can sense its activities. Now, if you do that, you should be able to create an accountability score of good data citizenship. And we are not talking about the formula, but it has certain properties. If you submit something great once and never submit in five years, your accountability would go down a little. That's expected. Now, if a lot of people are talking about you, there's a lot of buzz about your paper, well, it should increase because you are clearly affecting and influencing the community. Similarly, if you have a publication and if you have more referenceable data than the voltage of the publication should go up, that's not very hard to implement because the technical infrastructure for this exists. And then, regardless of whether it's person or publication or data center or whatever agent you have in the system, you should be able to take this and also create an influence score and an influence class. So you should be able to say a data center is really a specialist in this class of data but not in that class of data. We do that a little bit in it because we say this source is more appropriate when you are doing a physiology query versus when you are doing a pharmacological query. This is not that hard if you have a semantic framework like an ontology in place. And guess what? The social networks already know this. So there are influence engines, that's what they are called. One is called clout, you know, what's your clout? That's a number, my number is 48. Obama's number is 99. Seriously. There is another called peer index. And the third one, it's called cred. Now these engines, what they are doing is they are looking at activities and computing scores. Now whether there's good for the society or bad for the society is a secondary question. However, the fact that they are looking at impact, they are trying to find measures of impact, I think is significant. And I think more of this will happen. Now does this directly apply to science? Not quite. Now if you are a theoretical physicist, we should not worry about your data submissions. The system should know that. The system, these measures are for online activities. So it's expecting that you sort of update every day because they're working off of Facebook and Twitter. Now that's not how science progresses. So parameters need to be adjusted, equations need to be changed. Also, I suppose some of this course will be community-based. In some sub-community, the rate of publication may be higher than another community. That should be factored in. I'm sure there are a lot of people who will object to this. They'll say, what, you are judging me? Are you crazy? This proposition makes no sense. But I believe that A, this would happen. And B, this would happen by through third-party watchers, like very good viewers. And I think if this happens, there is a lot of in-kind incentives that people would get. For example, people would want to know, are their data cited enough? It's good to know that your data has 1,500 derived products or papers. It's good to know that. So if you do this analysis, these analytics would inform the utility of your research. And I think that's important in progressing science. And I'll stop here. Okay, we have time for really a single question before we move to Merseysau. It's a comment to both yourself and Cameron, really. I think the first time I bumped into the problem of submitting supplementary material data sets to a publication was in the 90s. Three months ago, there was zero progress in 20 years. So it's all well to say, people need to submit their data set. But currently, we can't even submit a script or a PDB file or things like that to PLOS, for instance. So I think it's not just the scientists who are the bad guys here. Even the scientists who are the good guys, they just can't play the game. I completely agree with you. I said this is a problem of the entire infrastructure. So there are many players here, and there is a need for a large sort of backend thinking and holistic thinking. So that's why I didn't want to point it only on scientists. Yes, yes, I had to.