 The afternoon session started. So if I could ask everybody to try to sit forward a little bit, like take some of the front rows, so that even without a mic, we don't have to yell. So that would be great. Yeah. A nice community, right? Okay, so this session, my name is Fajian Long. I was a biologist before, and last year I joined the University Library here at Carnegie Mellon. So this session, we're going to talk about open tools and platforms. So our first speaker is Hedbusti. Right there. So Ben is a data scientist and the genomics outreach coordinator at the NCBI. So for those of you who have met Ben, you know that he's a very enthusiastic and very active in the community of the scientific outreach. And some of you might have met him at the genomics meetup events, and some of you maybe will witness this evening at the scientific speed data events. So Ben, please take it away. Hey everybody, my name is Ben Munty. I have 48 slides that I promised to get through in 12 minutes, so we're going to go reasonably fast as soon as these slides actually come up. There they are. All right, so I'm going to talk a little bit about genomics, because that's the area of the world that I primarily work in. That said, I think it illustrates an example of a way data sharing can be done in a lot of different scientific domains. So our friends at behind me house pretty much all the world's biomedical literature, except Preprints, which is pointed out earlier, is posted in your PMC. And we're also the largest genomic database in the world. We have something like 14 megabytes of data. So I'm a big proponent of not wasting time, so in case the next 11 minutes mean nothing to you, you probably use PubMed or PMC. So I just want to remind you that you can use the website, you can create advanced searches, and you can register for my NCBI account and have your searches done every week or month or whatever and email team. But if you're in sort of the API world, we have an API called eutils. There's a command line implementation called eDirect, and you can Google for the eDirect cookbook. It's on GitHub. You can find large bulk things very quickly. So that's kind of nice. You can put issues if you're having trouble coming up with things. You can actually locally cache your own version of PubMed or PMC, the open access subset. It's really easy to get those now. And then we've been experimenting, we've built some prototypes for corpus updaters for the PMC subset if you're in the natural language processing. So you can get more information about that, and literature done. Now let's talk about data. So in case you're not a sequence bioinformatician, here's sequencing in 20 seconds. We use a short reader long read sequencer. We get reads. We either call variance. In the case of RAC, we call expression. Or we assume, yeah. Anyway, so we put stuff up, metadata into cross data type descriptors. So you can search for things and then look at what particular data types are out there for your data of interest. But how do you find data that metadata is insufficient? And this is one of the two main things I want to talk about today. One, contextualized indexing of data. Two, getting out small data slices. So we taxonomically indexed all of SRA by using KMERS. This is not a perfect way to taxonomically index, but it's really nice. If you're interested in some funky family of viruses, now you can go to the SRA, which has millions of records, and find your funky family of viruses. So that's something we've been very, very proud of. And you can go and you can say, I'm interested. I was at the City University of New York and I said, pick a virus. And somebody said, herpes virus, right? Because everybody's interested in herpes virus. Anyway, so they picked herpes virus. And I said, let's look at some metagenose for herpes virus. And this was the best educational moment of my life because there are human herpes virus sequences on dollar bills in New York City. That was found data. I did not play this. That was kind of amazing. And I want to come back to this because I think it's important when we think about indexing. So say I want to extract some data, right? So say I have some data and the metadata isn't horrible, but how do I extract the data? So I'm going to go through this really fast, but we have this tool called Magic Blast. It doesn't really matter exactly what it is, but what it does, what it can do, is go into raw data sets and pull out exactly what you want, right? You're not moving the data. And that's really point number two of this whole talk. Don't move the big data sets. Extract what you want out of the big data sets. So how do I do that? Well, so I can just download a binary and I need to find something to blast into. So say I'm interested in this a-repressor in E. coli, then what I do is I go bioinformatics, bioinformatics, bioinformatics, bioinformatics, bioinformatics, bioinformatics, bioinformatics, and now I have a sequence, right? So I make a blast database out of that and then this is really easy. Now, if you're not in the command-line world, don't worry. There's this thing, software carpentry. You can get from where you are to right here in two days. Software carpentry is amazing. If you're not in the command-line world yet, you see a software carpentry course you should sign up for it. It will change your life. And you'll get a better job. So anyway, so what you do is you just open up Magic Glass, this is PowerXBZF, come in, you make a blast database out of your sequence, and you blast into it. But here's the cool thing. Here I am streaming, I'm using this simple command to stream from the data set, right? So I'm streaming data, I'm not dumping data out of anywhere, and the really nice thing is it runs pretty quickly, so I can do this in a couple of minutes, but I really feel like I'm in the 21st century because I can do this on a plane. So that's a really exciting thing. By the way, FTP does not work on a plane. Only HTTP. I can explain why later if people are interested. So anyway, so you can do this on a plane, and so we're streaming out of these huge data sets, and we're only taking a little, we're only storing the little pieces that we want instead of dumping and making copies. And I think that's really important. And you get SAM file, which is, for those of you who do genomics, kind of like the lingua franca of that, and it works with some other software, but here's a really nice thing. So we have the language bindings to be for anybody to build this into their software. So anybody who does genomics work can build something on here, allowing their software to stream data out of large repositories. So that's something I'm particularly excited about. But wait. Imagine we could take this and build an index of all virus sequences. No virus sequences. Virus sequences related to families and novel viruses that are in metagenomes in the SRA. So we have about a million metagenomic samples in the SRA. So say we could build indexes like this, so you know where the herpes virus is in your environment. And that joke is just not working. I'm going to stop using it. Anyway, so we started building this pipeline. And so actually this was built by a guy named Paul Cantalupo and some others in Pittsburgh, actually the first step. We built this really simple deploying graph to discover novel viruses. I won't talk about that too much. We wrapped it in Python. And then here we can discover novel viruses. We can discover friends of viruses. And we can also discover exact viruses like I showed you earlier. We built all of those things in hackathons. So what we're going to do this spring is have data analysis hackathons where we're able to actually index a whole bunch of different data types. So we're going to index viruses out of a million metagenomes. We're going to index RNA-seq. We're going to index haplotypes. We're going to index variants, a bunch of other things. And by the way, we're able to give everybody credit for all these projects by using osf.io. Just a shout out to them. I think there's somebody from in here. So here's some upcoming hackathons. There will be many, many of these data analysis hackathons as well as tool-building hackathons in the spring. So that's something to keep an eye out for. And so in December and January, we are going to build an index of viruses and then we'll start working on RNA-seq and antimicrobial resistance data. So what does building an index allow us to do? So if we have an index that everybody can more or less agree on, that we can put it back to data sets in situ and we no longer have to ETL them. So it doesn't matter if they're on a big data set in the cloud. It doesn't matter if they're on somebody's sort of local front-facing server. It doesn't matter where the data sets are. And here's the thing. Most people just want the knowledge anyway. They don't actually want the raw data. Should that raw data be available? Yes. How fast should it be available? That's a discussion for the last panel. But anyway, I think that is the value of indexing. We're also going to try to work on contextualized ways to think about the metadata and to harken back to one of the previous presentations. The most irreverent thing we've ever done in a hackathon is build SRA Tinder because some data sets are crappy. And I sent this team your framework thing earlier. Yeah, so because some data sets are crappy and so you should be able to eliminate data sets you don't want when you're streaming small bits of data out of large repositories. So that's it and I'm happy to take any questions. Can you put up the slide share? You had a URL for downloading the slides. Oh yeah, absolutely. You should put that up because there were a lot of them. I totally can. That was like a basic primer on how to use NCBI. So yeah, check that out. So the slides are going to be shared after the conference, right? With speakers permissions of course. Yeah, so the issue of crappy metadata, is this something that you guys are working to kind of legislate sort of the royal you guys with NCBI? Is this something you think is going to be driven by communities? But in my field of bacterial genomics it's a big problem with just no or crappy metadata which makes proper reuse and use a real challenge. So first, everything I'm about to say is my personal opinion and not the opinion of the NIH. Second, legislate is a very specific term that has to do with Congress who is almost certainly not going to do anything about bacterial genomics metadata. So, I finally told you. Anyway, so it is a massive issue and I think there are two ways to approach the metadata issue. And one is through harmonization natural language processing. Two, and this is my personal opinion, if the NIH wanted people to submit reusable data, R01 renewals would be based on how much the data was reused, right? Again, my personal opinion. So the Office of Extramural Research really should base part of the renewal funding on that. And third, I think there is some hope for making a metadata out of data and by some of these indexing things. And I'll give you a specific example from bacterial genomics. So what we can do is we can take raw fast cues that come off of alluvial sequencers and make context, right? Based on context we can decorate them with domains but more specifically we can look at antimicrobial resistance genes, right? When we look at antimicrobial resistance genes we can index which of our approximately four million bacterial data sets have specific antimicrobial resistance genes. So what we've done in that sense is we've been able to make an index of the bacterial resistance genes that actually came out of the data. And then we can go back and contextualize that and say, oh, okay. So these people said it was at this specific latitude and longitude and try to improve the metadata based on the metadata we extracted out of the data. Yeah? So those are a whole bunch of approaches, some of which, the last one, I have a little bit of control over and the rest of which I don't have a whole lot of control over. So, yeah. Yeah, thank you super. Thank you. So we can save other questions for... Thank you. Our next speaker is Sean Davis. So Sean is a senior associate of scientists at the center of the Cancer Institute. So actually, Sean, when he was young he did a summer school in Pittsburgh. So he loved Pittsburgh so much so that he decided to do his Ph.D. and family here at the University of Pittsburgh. Afterwards, I guess, he moved to NCI and from there on he did a lot of great things in the bioinformatics data analysis. So we're really glad to be back in Pittsburgh and getting a speech to us. And by the way, he's going to also speak talk about Tisha, welcome Dr. Rosha. So thanks for the invitation. I walked past this building every day for like six years and that little summer school thing, this is rumor, but I'm going to tell you the story anyway. There was a group that was assigned, one of the sister groups on campus at the same time, was assigned to figure out whether the tiles in this building were radioactive enough that you couldn't lean against them. So this building does have some history and some age to it obviously when you walk around. But in any case, I'm going to talk about the Cancer Data Ecosystem, data and cloud resources for cancer data science. So this is going to focus a little bit on the cancer data side of things, but I want everybody to think about this as a data ecosystem that is developing in a way that I think a lot of us would like to see our own ecosystem, state ecosystems develop. Slides are a bit of a link at the bottom. Here's the NIH. People ask me where do they work? What does it look like? That's it. That's the clinical center. It's the largest hospital on earth devoted to biomedical research. Believe me, it's huge. I don't work in that building, but... Quickly, I'm going to give an overview of the NCI Cancer Research Data Commons. It's a mouthful, CRDC, you can recall it. It's an API-driven, open and fair data platform for cancer research. I'm going to focus a little bit on a specific hub or a specific node, the Genomic Data Commons. Like then, I have a genomic focus, but NCI has a lot of data sets that follow the same sort of tact that you're going to see here. And then, finally, I'll talk a little bit about intro to cancer genomic resources in particular cloud-based genomics and using cloud for cancer research. Cancer genomic data challenges. As of now, there are actually, this is a little bit old, three petabytes of data associated with cancer that NCI hosts essentially. There used to be fragmentary repositories that hosted these data in various pieces. It was hard to find things because they weren't in one place, and they were annotated metadata and actually the data would process differently. So, assuming that a two and a half or three petabyte data sets available, what would you need to use it? Well, you would need $2 million worth of storage if you wanted to host it yourself, and you would need about 23 days with a 10 gigabit connection to download it all. So, those places don't want to do that anymore, so NCI has been thinking about what approaches for getting around that. All of this is sort of couched in things that we've heard, fair data principles, fair guiding principles. And the NCI Cancer Research Data Commons has a particular sort of vision. It's to create a data science infrastructure that's necessary to connect repositories, analytical tools, and knowledge bases. Now, I would argue that this is actually too limited a scope, but I think I'm going to show that we can actually extend the scope a little bit without having to have NCI do extra work. So, data intensive cancer research, you can think of it in this sort of nodes and edges diagram. You think of each of those nodes as either a data resource or an analysis resource. The way that we interconnect those things is through edges, and those edges are typically described these days as APIs. So, if we have APIs, and we have a description of what those APIs are able to do with respect to either data or analysis, we can actually build fairly large pipelines or fairly large and complicated research infrastructures without having to do anything locally. I'll show an example of that in a minute. So, the genomic data commons. All data science starts with data. So, we have to generate it, and then more importantly, we have to make it findable, accessible, interoperable, and reusable. The genomic data commons is really designed to do a few different things. It's to unify fragmentary data repositories, but put everything in one place. It harmonizes data and metadata across existing and new cancer research programs and projects. That's an upside. That is, we get good metadata and data. It's a downside, because as of right now, the genomic data commons contains only 40 projects. And NCI probably generates about a thousand or more data sets per year. So, the reason there are only 40 in there is because it takes months to develop a data model associated with a new data set. This is the 40 projects. There are 33,000 or so. Actually, it's about 40,000 now cases, individual patients data. And it's now up around 500,000 files. These are all metadata annotated and findable. This is what a website to find the data looks like. There's essentially two ways to enter the data. Either files or cases. If you want to start with files, this is what I'm looking for, a certain type of file associated with a certain assay or cases. I want patients who have breast cancer who are under 35. You marry those two as a query and you get back a subset of either cases or files or both. This is all driven by an API. And the API is driven by the underlying data model. Each of those circles over there you can think of as a relational table and a database. So to describe just the metadata, this is now not the data, just the metadata associated with all the information in the Genomic Data Commons, you have to fill out each of those tables for each entity. So that data model then allows us to query into it using an API and to get interconnectivity between these different entities. Essentially, to find the file of the cases that we want. You can build on top of this API an NCI, data browsers, data discovery tools, etc. But even though NCI spends quite a bit of money, I won't tell you the number but it's at least six digits or more. You can get these kinds of things. But what happens if there's something here that you don't, if there's nothing that NCI's building does what you want? Well, because today the ecosystem is extensible and we can leverage this open data concept of nodes and edges, APIs, and interoperability to enhance data value. How many of you have heard of bioconductor? All right. That's bioconductor. It's about 1400 software packages for understanding and interpreting high-throughput biologic data. So if we can marry the tools and the developer community and the user community, it's around half a million users. If we can marry those folks with the data that are in the Genomic Data Commons, we expose those data in the Genomic Data Commons to a much larger research community than if we just relied on them on people finding a particular tool that's already built for them. So early on it built a package called the Genomic Data Commons package for bioconductor. It does something very simple. It just talks to the API in the Genomic Data Commons, the NCI Genomic Data Commons. So to give a really quick example, we want to explore the somatic variants, that are specific to the tumor in patients in a melanoma cohort. And we want to do this in a reproducible, reusable way using bioconductor. We want to go from pointing and clicking on a website to this. But we want to do that reproducibly. Well, unless we're taking screenshots of all the pointing and clicking we're doing, it's not reproducible when we do it on a website. So let's write some code. Starting from this, this is API. If I click a box here, that's essentially the same as doing a filter, filtering the data in bioconductor. So I start with files in a square box. This is the code to be able to do what I just showed. To be able to make a plot I just showed. Start with files. We're going to select melanoma. This is what that looks like in code. Cases, project, project ID and this TCGA melanoma fits that. Math, somatic mutation. You check two variant aggregation in masking and a math file. Three more filters in code, written down, easy to see, it's actually fairly easy to understand. For those of you who use Deflier, it works very much like Deflier. We get one file. The IDs code pulls back the ID for that file. It's a universally unique ID so we can actually share that with people. And then finally we download the file. One block of code, it's actually only one expression in R goes through all that filtering on the website and gets us the data. We use a package already built into bioconductor called Naftools. Two more lines actually, it's three more lines because I left out a plot. Three more lines in code gets us this. So with about ten lines we've entirely reproducibly gotten the data from actually the largest NCI repository and in a fully reproducible way made a plot. If the data happened to change we rerun the code and we get the data. We get a new plot. So that's all fine and good if we want to stick to datasets that are downloadable that is process datasets that are downloadable. We're obviously moving to a place where we want to be able to share our knowledge and that knowledge may actually be in very large scale workloads and pipelines. So we want to collaborate at scale. And a lot of people talk about cloud computing because you don't want to download the data because it's scalable. My argument is that cloud computing enables something that we haven't been able to do to this point which is collaboration at scale without much work. So reproducible reproducibility, reusability, collaboration without workers. That's what we're going for. Team science here is critical. Our teams as we've heard I think every talk this morning teams are not in your lab anymore they're spread across the world. There's a nice talk or a nice paper recently by Ben Limey on this topic and he comes to the conclusion clouds elasticity reproducibility and privacy features make it ideally suited for this kind of work. We're moving away from this model of this local computing storage which silos our science not just our data our science and moving toward a situation where we have this cloud computing capability with data workflows and security built in. There are three cloud resources that enable this kind of technology here's what one looks like you really can't read it but at the top left there are projects projects in this platform are shareable so all I need is your email and I can share this project with you no matter if it has two files or 10,000 workflows are also shareable accessing the links are here you can get access to this if you sign up it takes about 15 seconds you'll get $300 worth of credits and if you run a small RNA seek analysis using one of the newer RNAs tools it's about 10 cents per sample so $300 will get you 3000 samples or something so there are a lot of challenges I'm not going to read through them one of the biggest ones that I want to highlight is data ownership and valuation so there was a lot of discussion this morning about who owns the data and what's its value we know that it's very valuable frankly we don't know who owns it even if people have consented we don't know exactly who owns it so those are open questions but in any case things are moving forward we're learning as we go links of potential interest and then finally my contact information thanks Any questions for Sean? Hello, it was a great talk, thank you very much I have a question about the bio conductor package you showed so the first technical question the data are being downloaded directly from the NIH website that you showed us right it seems to me that that's a little bit fragile to URLs changing over time to the NIH infrastructure over time so like in the challenges you face like what sort of solutions are you envisioning for this kind of an infertility yeah so these days APIs if they're written correctly are self-discovered that is if you have a URL you should be able to discover and that's exactly what happens here I haven't written a code that is specific to the API I discover the API every time I start the package and good APIs are written that way for exactly that reason so that we're not talking about not being able to change an API over time in terms of the URL yes, if that changes then the social contract is broken and that's actually happened once with the genomic data commons so in that case I have to change my software and I write and ask to be noted in the genomic data commons people saying hey, don't do that again without telling those of us who are using this kind of an industrial scale which we're doing that's a communication problem that software developers needs to work over time thank you Sean for this wonderful talk so our next speaker we'll have the slides our next speaker is Kevin Gapank he's a senior data wrangler from the encode data coordination center so encode has done a lot of wonderful work to work on the developing computation pipelines to make reproducible analysis for the DNA and for DNA elements and data curation so I'm going to let him tell you more about it first of all I'd like to thank the organizer for inviting me and letting me present the encode project this scientific meeting I'd like to start from asking you to imagine that you are provided with a map that gives you a location of a treasure that is out somewhere with this red X so I imagine that you as true treasure hunters will go there and try to find the treasure and surprise, surprise you will see a huge rock exactly at the place that is mentioned on the map and since you have a huge rock probably you can't access the treasure that is under the rock and let's imagine further that you were able to somehow move the rock and discover it doesn't really mean that you now have the treasure in hand once you enter the cave well obviously not because the cave may fool dangerous creatures that will prevent you from getting the treasure what I'd like to say that even if you provide access to the location where the treasure is, you know not all this will be able to get the treasure similarly if you have open access to data if you have some scientific resource that you can access it not always means that you will be able to use the resource and not always you will be able to find the data even if the data is there and that happens frequently since the implementation of the resource was unfair not the principles that were used to to call the database were not following the third principles and that brings me to the ENCODE project and the way we implement our database and how we try to make it really useful to the scientific community I would like to first start from short introduction to the ENCODE project itself as I'm not sure how many people here know what ENCODE project is so for the benefit of those who doesn't ENCODE stands for integral video of DNA elements it's a project that was initiated right after the human genome was sequenced and the main goal of the project is to identify all the functional elements of the human genome during the pilot phase in 2003 the target was 1% of the human genome being confiscated and in the second phase of the project the data coordination center was established and it was in UCSC and it was integrated with the genome browser that anybody that does genomics probably knows about in the third phase the data coordination center moved to Stanford and more or less at that time there was a huge problem of the way we treat the data and the way we envision this resource and project manager at that time Yuri Hong envisioned that as a Zappos for genome community so we wanted to create a site that will allow you to find your data as easy as to find pair of shoes on Zappos and the way to do it was to create pretty sophisticated data model all the metadata is captured in JSON objects and we provide the RESTful API for submission and query the database and once we've done that we started to get a huge amount of data which we realized pretty fast that needs to be treated in a specific way to allow this data to be really comparable and to allow integrative analysis in there to address those issues we established uniform processing pipelines using DNA nexus platform at the time and those pipelines are irreparable even today if you want to analyze your data using those pipelines and now we find ourselves in the fourth phase of the project we have in addition to the production labs that generate a lot of genomics data using common assets such as ChIP-seq, RNA-seq chia pet, DNAs, etc we have added functional characterization centers their goal is to take predicted functional elements and to try to test them and characterize them and to give us experimental evidence for the functionality of those elements in addition to the functional characterization groups we have rethought all our approach to the pipelines and the new pipelines that we are implementing are done using a framework that incorporates Docker technology and allows you to take your tool that you would like to use to analyze the data containerize it and make it really portable and reproducible if you have any platform where you would like to use this tool the end goal project currently consists of more than 20 labs all over the country and all the data that is generated by those labs is submitted to the data coordination center in Stanford and the job of data coordination center is to facilitate submission from the labs to the coordination center review and curate the data communicate back to the lab the results of this curation and review and then refine the data and this process is iterative and only when the lab and us in the data coordination center are satisfied with the results the data is shared with the scientific community and also submitted to other generic resources now as I pointed out our job is to to make sure that the data that gets published is of high quality so when the data comes from the lab to the end goal data coordination center we apply pretty high standards in the data and only data that passes those standards and by standards I mean both on the level of the data itself and on the level of the metadata that is submitted only then the data is successfully accepted and released currently the portal hosts data not only from the end goal project but also from additional projects such as pregnancy, GGR, modern and modern quality data we have more than 50,000 experiments almost 50 types of different assets and more than 640 terabytes of data files now as I pointed out earlier when you have this amount of data and you want to make integrity analysis but it will be comparable to reducible so we gather all the metadata in JSON format we establish data model to accommodate those things we since it's a lot of data we provide programmatic access both to submission and for download and the data ends up on the cloud recently we have added the ability to access the files in the cloud without the need to download them and locate them the metadata of the track is really diverse and we have a lot of end reach here you can see a handful of examples of objects each of those objects will contain multiple properties that will be filled in when the data is submitted all the data is bound together using ontologies that make identification of types of samples unambiguous they allow us to group the data and facilitate the search of data and we have to maintain this balance between asking for a lot of metadata but not too much because we obviously understand that it makes life in submitters very difficult if we require a lot of metadata but due to the time that we get all this metadata we can track provenance from result file back to the download that was given the sample that was tested in some experiment so from the PIX file you can go to the alignments from alignments to reads that were generated in the sequencer to the library that was produced in the lab sample antivirus that was used and back to the download to allow comparable integrative results and to ensure reproducibility in the third phase we have developed those single processing pipelines on the DNA Nexus platform and since the docker technology was introduced and is such a success and using the experience that we gained working with those pipelines in the third phase of the project we moved on to establishing a pipeline development framework that is based on docker widow and combo technologies and if you are interested to learn more about this framework you are welcome for my workshop tomorrow and with that I would like to thank the DCC members that work so hard making this research possible and it's your right for funding questions for Eden no questions so I have a question with the how is like so this is like community data being submitted into this data coordination center how do you it's not a community data so and what is the construction that is and also the data we have that is the data we do work and plan to allow so far the last majority of the data is from labs that belong to their you should think we have the data one thing that I'm curious is are you collecting data on how people are using this data do you have a sense that people are actually going in there and computing on the cloud or are they going in there downloading to the machines using some indexing tricks like you saw before sure so first of all the data that is on the cloud and the public access to the data is a recent development so we don't have really any sense of how much people started to use that because it's really very recent about the use of the pipelines and use of the data once we introduce this framework with the pipelines in the with the Docker format we got a lot of communication from the users that started to use those pipelines and they called different bugs, issues, questions etc so that's reassuring that people really are using those pipelines and we're really happy to get feedback and improve those pipelines and in terms of statistics on general use of data and access of data we have them but they're not like I cannot give them numbers so the context of this is at least for the human connector problem which is the data collection of human right data I've heard people say that the data is available on the cloud but usually people don't as they go and download it to some disparate level as it was intended I'm just curious if that would play out that way as well I couldn't understand that but I don't think it's feasible to download the context of the database of data to the local computer the other thing is that we believe that the future is in the cloud and people will I believe that people will move the computer to the cloud and when the computer is in the cloud it's convenient to use the computer move the computer to your data do your calculation and get the results and not move the data to the computer I wonder if you could tell us a little bit more about how you came up with the standards like if they were second industry wide or you just made them up at ENCODE or whether they're sort of universal standards when you say standards which standards do you refer to? your diagram between the data and the release okay so there are two I would say there are two different groups of standards one is metadata standards and one is data standards so by the data standards I mean to be high quality analysis experiment you need certain read depth or coverage or any other metric that the scientific community decided as the valid high quality experiment and to decide upon those numbers or metrics we have working group within the consortia debating and deciding on those values and what data production center are doing we are applying the standards so the decision is made in the working groups we are applying and presenting that to the user so the user that goes to the certain experiment will get clear batch saying this experiment has this and that read depth which does or does not comply with standards that's one group of standards the other group is metadata standards and those are dictated by us and basically the idea here is to establish data model that will allow us to curate the data that it will allow us to present the data in a way that will be useful for the users and facilitate search if any of the specifications on the object that people submit is not fulfilled and that jeopardizes our ability to fulfill our mission we are going to block the submission and require specifics another I would say it also belongs to the metadata standards is validation of input so you cannot really assume that for instance files that are submitted to the resource contain exactly the format that is claimed to be submitted simply because when you're talking about hundreds of thousands of files inevitably some mistake will happen and somebody will submit instead of fast pre-file XML file and nobody will know about that because nobody checks that so we do apply rigorous checking of those things trying to validate every single thing that is submitted that it really what it claims to actually be okay thank you Eden there's no further questions we move on to the next speaker class speaker he's a senior data scientist at the university of Washington E science institute and he has a passion for analysis and share of large scale open data set so thank you so much for inviting me it's really interesting to hear all these things the slides for this presentation are on that website so you can go to see these slides there are links to some of the references and so on I've also tweeted out this link so you don't have to hurry up and copy so I work as a data scientist something called E science at the University of Washington we're the main hub for data science activities at the University of Washington and that means we do several different we have several different arms of what we do a lot of what we do community building activities we have a data science studio which is main space where we work and where others can come and work with us where we do events and programs host working groups and so on we do take part in education all the way from degree options and courses to various kinds of workshops I'll talk about one particular workshop towards the end of the talk and then we do research both in the development of data science methods applications of these methods and then write open source software and my talk today will be about open source software specifically for neuroimaging I'm a neuroscientist and I'm particularly interested in brain connectivity in parts of the brain that connect in different parts of it we've known for a very long time that brain connectivity is really important for a variety of different functions these are 19th century neurologists and they figure that out by looking at patients who have certain discontactivities in the brain there's a wide matter of this post-mortem brain and you're looking at these large scale connections between different parts of the brain and that's what I'm interested in and we have many different ways of looking at that experimentally ooh, so we can't see them here so I'll just skip forward we have many different kinds of experiments that we can do, there are many different methods and traditionally the cycle's been that we've been able to collect this data analyze it, interpret it and then feed that back into additional experiments that we're doing but neuroscience is undergoing a kind of a shift from a single lab doing their experiments at small scale to also include a different model of knowledge production which is more similar to these observatory driven science projects of course they're a project I think called a project that is also kind of an observatory and many projects in genomics are actually like that the examples I have here are from feels that are a bit further away than from neuroscience the astronomy sky surveys that produce very large amounts of data and distribute that to the community and the energy physics here the larger hydrogen provider but also that community produces develops very large instruments collects lots of data and produces this data in a way that the community can analyze so moving to this kind of observatory driven science where once we had to do individual experiments and do the labs there are several examples of new brain observatories that are coming to life so to be honest the brain science is a brain observatory they actually use the term brain observatory to refer to one of their big experiments that they're doing the human connectome project is a bigger brain observatory that has been going on for quite a while now and is collecting high quality MRI data from over a thousand individuals and these brain observatories keep growing in orders of magnitude the multi-brain network from the child mind institute will ultimately collect 10,000 brains I believe you've seen some of this data earlier today in Anisha's talk there are other projects with similar goals collecting also around 10,000 I think they've actually, the ABCD project has now reached more than 10,000 subjects that we've collected and the UK Biobank will eventually collect up to half a million subjects and we'll all have high quality MRI data from multiple individuals and that gives a lot of opportunities for research, we, these new data sets will enable important new discoveries for example in the UK Biobank there will be such a big amount of individuals that we can, just by the probabilities we can, we can guess that there will be hundreds of individuals who during the time the study will convert to Alzheimer's so we'll have a view on a fairly large cohort of people converting into Alzheimer's just by virtue of a number of individuals collecting and that's a new kind of data set it's very different from the kinds of data sets we've collected before so that allows us to do, we think of data as the data during discovery Dr. Slavin mentioned, you know the fourth paradigm of science this is the fourth paradigm of data during discovery but there are challenges to come with that data arriving at these unprecedented volume variety of velocity we need new tools and we need new approaches and we need to think of new socio-technical structures to sustain these kind of data science approaches of these data now that we work as more as a community of consumers of data we need to think about how we organize ourselves maybe more similar to the way astronomers and high energy physicists organize themselves and I think one thing that is becoming clear is that open source software, software in general is a necessary component data, we need to produce software that will analyze these data and we need to organize ourselves around these software projects so one approach to this is open source software for science I fall strongly on the side of choosing Python as an ecosystem for scientific computing and I'll argue for that next first of all it's free, it's open source so anyone can use it on any platform that they choose to use it it's a high level interpreted language to pick it up and start using it quite effectively there's very wide adoption of Python both in industry and in academia and that's important, for example in astronomy, again going to astronomy here the astronomers have this collection of the literature in astronomy that you can model as you can look at mentions of software and you can see that Python has been rising steadily even though other kinds of software have also been rising astronomy is adopting Python quite heavily why are they adopting it what's the benefit let's think about the ecosystem that exists around it so there's Python in the language itself and over the years people have developed around Python various tools to work with the data NumPy and SciPy basic American computing tools over the years people developed other tools interactive work, plotting and interactive computation and map law with Jupiter and for this beginning ecosystem people started building domain agnostic tools for machine learning for performance computing for image processing and so on and people took these tools and built on top of them tools for particular domains biopython for biology astropies in astronomy the core of what I'll talk about today is NiPy is neuroimaging in Python so there are several different projects and they're all stemmed out of this but others in neuroscience have also picked this up so now the Allen Institute for example is developing also Python software for their tools and in parallel people in industry realize the value of data science and using Python for data science and so tools that industry companies are developing in this case high performance computing or for deep learning also sitting on top of an API in Python so there's a lot of value to that and through the Jupiter project we can easily interact with other open source languages like R and like Jupiter so this whole ecosystem becomes a really strong foundation for work in open source science and what's nice about this is it's a network of interactions so we can learn from each other as we work so NiPy is as I said stands for neuroimaging Python and really the way to think about NiPy is not so much as a project but more of a community of practice a community of practice is a very loosely knit aggregation of many different projects and individuals who try to learn from each other about the work that different people are doing within a certain domain the focus is on common so these are common resources, shared resources that are maintained collectively and available to anyone so many of these tools that are developed through this community are not owned by a particular lab rather they're open resources that are open both for use and also for contribution so back to the white matter what I'm interested in, I'll focus in even more on a particular project that focuses on diffusion of riot so these data, diffusion of riot data are a way for us to measure the motion of water in particular parts of the brain water moves around, diffuses around inside of cells in a place like this that has round boundaries that you might get equal diffusion in all directions and we call that isotropic diffusion but if you're in one of these cables that I showed you before that connect between different parts of the brain you might get more diffusion along the length of the axons along the length of these cables then across these boundaries and that's called anisotropic diffusion and we can use diffusion MRI in order to derive measurements and statistics of the degree of anisotropy so we can get a measurement or an estimate of how much diffusion there is in particular locations where we're looking at a horizontal slice through a brain and we can measure, we can see that in these holes, these ventricles in the middle of the brain there's more diffusion in the core of the white matter and we can estimate how much anisotropy there is, how directional is the diffusion in particular location and that's called fractional anisotropy and we can even say in what direction is the municipal diffusion direction we can use that in order to track major fiber bundles major tracks in the brain so we have software that can do that and for example if we focus on some particular track in the brain we might connect that to let's say a disease so for example now we're looking at these big tracks, these are the corticospinal tracks on both sides that connect the brain to the spinal cord and control motor activity and in amiotrophic lateral sclerosis ALS, it's a Lou Gehrig's disease we can see that patients have a lower FA this is the degree of anisotropy in some parts of the corticospinal tract relative to health and control so we can really look at the biology underlying disorder using this so DIAPI is a project that focuses on open source diffusion MRI it's one of those projects within DIAPI so if you think of the broader ecosystem in all the way and I work on this project together and Althea's recently done a sport from the NIH to develop and disseminate this project and it's open to users here we just show that a lot of people download this and use this, we know that people use it in various places but more importantly it's open to contribution so people can show up and talk to us and contribute their code we've had several contributions from Google Summer of Code students that have participated in this project and we've had contributions from a variety of different places that's sort of the power of this open collaboration is that we can capitalize on a large community of contributors there are challenges for this kind of initiative or these kinds of initiatives sustainability is one kind of problem assuring that the quality is high is another kind of problem I'll just mention two kinds of initiatives that try to address these IRSI is the US Research Software Sustainability Institute the institute doesn't exist yet but there is a conceptualization project funded by the NSF that will establish eventually hopefully an institute for software sustainability similar to software sustainability institutes that exist in other countries so you should keep an eye on this and then another kind of initiative that I'm involved in is the Journal of Open Source Software so if you write software this is an opportunity to submit your software in short format a paper of value of software and have the software itself be reviewed for its rigor and it provides canonical citation for the software and then given that I'm short and out of time I'm just going to skip ahead a little bit okay so at this point I'm just going to mention one more thing you might ask yourself if you're a neuroscientist and you're doing neuroscience how do you get involved in this kind of community I'll just mention one thing that we do annually that I think is a good opportunity to get involved which is a summer institute the neuroscience and data science that we do is here called Neurohacademy that's held at the University of Washington D. Science Institute in December so it's a great time to come to Seattle and you can see this picture captures the joy and the pain of hacking all of one and we wrote a paper actually published recently in the PNAS about this kind of format where we rave together people from different backgrounds we call hack weeks we've done the astronomy and geosciences and now also neuroscience and the format there is very open it's participant driven and focused on building a community of people provides a little bit more of an on-rent of the traditional hackathons in which you're assumed to already know how to do everything so we think it's a really good format for bringing people in into these kinds of communities of practice and with that I will put in my contact information thank you for your time questions for Ariel first I have a question are you recommending everybody to switch our from Python yes I think that would make the life easier so I think we can look towards the astronomy community for example what they've done in the last few years astronomy used to be very entranced in an ideal and their ability to move to Python pretty minimally has allowed them to do phenomenal things in terms of the software that exists to do astronomy and their software ecosystem is phenomenal but I think using other open source tools is also those tools are pretty interoperable so if you're using open source tools it's not too hard to interoperate I was wondering if you could talk a little bit more about what the role of hackathons or training trackathons or training workshops what is the role of those going forward and sometimes there's been concern that hackathons are exploiting people and trying to monetize this tool what's a way to do it where people are really getting something out of it and for training whatever thank you for asking that question the approach that we've taken is quite different from the word hackathon hackathon is quite loaded and we don't necessarily like it and the connotations that it's loaded with but we've kind of adopted it anyway more for the idea that this is an opportunity to experiment do something a little bit outside of what you usually do something small focused maybe on an attempt to do something quirky and out of the order rather than the idea traditional hackathons in industry are focused on competitive interactions particular data set particular project that everybody is trying to compete for we don't do that at all instead we open up the floor and we say is there something you'd like to do this week with others here and people right on the board and then people can join together it's a terrifying moment for somebody who organizes this kind of event we think nobody will stand up and it's like at the end of the talk nobody asks a question but it's worse because you still have five days left to fill somehow and people always do come and propose things these hackathons we pay a lot of attention to how we create an inclusive environment one that allows people with less technical experience to learn from others and allows people to teach each other things in the paper we describe so one of the reasons we think it doesn't provide knowledge is that we ask people and we read a little bit about what we asked and what people say they think of this as a good opportunity for networking for scientists in their field so it's domain specific rather than being focused on some technical aspect of the work it's focused around neuroimaging in our case and so there's a community of people that are relevant to you that you work together with and also people feel that it makes them better scientists in the sense that they're now more able to do their work open because we focus a lot on different source tools and tools to make your research open but I should say the paper itself also includes the long supplementary material that is a recipe for how you might hold a hack week at your own institution and you really like for people to pick that up and start developing the hack week tool kit to improve the algorithm like checklists and schedules for when you should do what so that people can adapt this format and adapt it to their work so if no further questions thanks Ariel I would like to invite all the speakers from this panel to go up so now does anybody have questions for any of these speakers community practices and things so there's a tension between like how discipline specific should we make it so that then it's really useful for the genomics community but really garbage to everyone else is it worth developing all those standards within different communities and even methodologies I mean I work in neuroscience so there's bids and it's specific to MRI and now they're making all these other kinds of bids and like does every discipline need to have 10 different standards that they use could you help me think of that so if you knew however we were going to respond you were the best question so actually I mean I think that's why it doesn't necessarily make sense to make top out standards right I mean everybody in here has seen the XKCD standards comics right and we have people they come up with standards they don't always work and so I think really what to drive standards development what we want to do is really put things in the eye of the beholder right so the correct standard is the one that people will use right so then if we have a system where people are motivated to share data such that other people will use it then they will find the standards that make other people use their data and when data needs to be integrated they will define the standards that make people other people integrate their data therefore widening their audience right and so I would advocate actually a bottom up approach to say do you have to check 17 boxes I mean that doesn't seem very rational to me if what you want to do is have other people use the data that you are paid to generate this was present I think part of the solution comes from software that makes the data useful right so there's sort of a way that both of these things kind of meet each other in the middle once there's software that makes the data useful but it's only useful if it's a certain standard then the standard becomes something that people actually want people to use in a similar way so to answer your question more specifically I think there will not be sort of a general standard it depends on the experiments that people are doing all the experimental fields that are generating today that generating data today are going to evolve to do different things and we're going to have the standards we're going to have to evolve with these things I don't think it's going to be easy or worthwhile I think one of the things that we all are talking about today is data sharing and data sharing and data reuse are very different things and I think the focus on data sharing is actually detracted from the real question which is what is data being used and what is data value so sharing the data is not useful but using data is useful valuing data is putting value on data is useful so there's sort of a blend of these two maybe an operational definition of what metadata is an operational definition shouldn't focus on this individual definition focus on the person we use in the data and then the second thing is that we need to incentivize it appropriately and there are data sets that we share that have no value that is they've never been reused they will never be reused and that isn't necessarily because the data themselves are not valuable so until we get to a place where we can measure the value the reuse of data we won't have the incentives in place to make the data more valuable that's a process that's a data maturity process that a lot of companies have already come to they understand that they have value in data the way to make it valuable though is to attach metadata to it data lakes and data swaps and what distinguishes one from the other is the value of the data and to make it valuable is work because the one last piece I would say is that metadata collection and metadata curation is not something hard there are people who do this they're paying money they go to school for like 35 years to do it and there's a reason for that it's very hard to do well because as a funding agency we need to spend more time on figuring out how to get people to know how to do this in touch with people who don't so that we can make things as efficient as possible in terms of data reuse also in terms of especially is there a room for a federated standard approach instead of standards in like a there's a lot at least in health related health related research data genomics, imaging, etc there's a lot of indexing of indexes and I think that we haven't figured out we need to tell them I'm wrong but we haven't figured out how valuable those indexes are at some level we're very early on in figuring out I would say even in some ways further long I think we're still actually pretty what I try to focus on is making the data as computable and accessible as possible and so particularly when you're talking about tens of millions or hundreds of millions of records if you want to operate on those data to learn from them as opposed to search them you need to have those data in interoperable formats things like JSON and metadata but it's pulling it out of these APIs that are proprietary and putting in that hands of people with tools, data science related tools that make those data much more easily so changing they had a little bit changing from being a repository person to a data scientist I mean really the format that I can never use is the data that is not there I'm starting with somebody who weren't to exist though in theory I can put some novelty into a vector and compare it to something else but if you put nothing there if you put the less metadata there is another data is usually pretty small if you put no metadata data there then format doesn't matter so I think perhaps as a first pass just trying to get into this as a community the emphasis should be on putting things in there and then things will fall out most data types I mean we had a scary discussion of data type this morning most data types can be merged in some way into something else but if I'm missing if you give me two pieces of metadata then there's only so much contextualization Any more questions? I'm going to ask because I think you were kind of getting there two different things someone would say that reuse is much better than just sharing data but then you're saying it's better to just get the data out there and so this is something that I run into a lot at the moment with researchers because the incentives are going to spend a lot of time if they exist for a data scientist it would be an awesome thing but generally I'm seeing both side researchers who share any data set because they're publishing somewhere that says you must put the data set somewhere that says you know why it looks good and so there's this tension then between saying okay now go back and like document this all your data takes your type 1 so is there better? I mean what's the way to get people into this without totally making them spend a lot documenting data sets I can attempt to answer this question so we get a lot of questions about data yesterday I got a question about Illumina is using some dataset and the dataset that apparently used enhancers from the data set the only thing that was there is that it's 450k now the president contacted Illumina asking which enhancer predictions were used or how they were generated and Illumina went down to and called like well asked and called they generated these datasets now how they were generated which software was used to generate those enhancers and the person ends up with no answer to the question and no data to answer the question like it's really impossible to go back 10 years and figure out how those enhancers were generated so with no metadata it's not used right now to say that it's easy to document everything not even defined for master documents we don't but I think that's necessary without that it's just a couple of words first you can't have complete depth and complete breadth at the same time right you can't have the idea that you can have 500,000 datasets and be down to the base pair level or the individual dendritic spine level on every single one of those datasets is somewhat ridiculous and also it's hard to see the use of that so I think there are some levels of data where you start to see that sort of clustered indices are more useful so those we can do with datasets like in code today that are highly curated and we can bring signal out of there but that brings me to point two which is without a lot of information theory work it's very hard to know what is increasing information density and which directions information density needs to go where the holes are that kind of thing in short we don't know what is important there was a discussion earlier about how we won't know what data connections come out of things in terms of personal identifiable it's very difficult to sort of know the future and it's hard to know a priori what is important in a dataset I mean what is important to me may not be important to other people in the audience about my dataset so I think there's a space for both for clean harmonized things which we've already established are important and then there is also a space for not so clean harmonized things unharmonized things that may be important in the future and I think that brings it back to the beginning of the point which is with metadata 2 you can't have complete breadth and complete depth if you're going to give lots of information lots of content you can't expect for people to harmonize all of it because it's hard so you can harmonize a little bit of it and then have a whole bunch of gobbledygook that people can sort through if it's important so my data reuse policy thing is related to data where you have a lot of money to store and we're expecting that my sharing aid is going to be high so I'm going to tell a different story which is the story that when I moved over to NCI in 2007 next generation sequencing was literally just starting our lab bought the first sequencer on campus and for those of you who don't know what next generation sequencing was it's like riding on seen a trickle of water and then immediately without being told you're hammered with the largest wave of data you've ever seen overnight so we needed a way to deal with that and what we did was horrible for the rest of the lab and for those of us who were dealing with data it was really established a laboratory information management system it's pretty simple but the idea is that before any data flow for any samples flowing we have to it's time consuming but at the end of the day what it allowed us to do is to build basically a fully automated data management system and see back to 2007 every base of every run and the way we did that was a technical solution but the reason it worked was because the PI in the lab said we're not going to do it any other way my experts in the lab, my data stewards tell me this is the way we should do it this is the way we're going to do it so what we're talking about here are a lot of technical solutions but at the end of the day it's a social contract really the incentives can come from lots of different views but if your boss tells you you're going to have to do it this way that's a pretty strong incentive so it's worth thinking about that these are hard problems they sometimes have to see solutions I'll just add one more thing which is when I talk to experimentalists they tell me that data sharing produces not only is there no clear incentive it produces disincentives because you're spending years of your life sitting on a complicated experimental system doing the experiments then you publish your first paper and then you're required to make your data available for all those people who are sitting and waiting for you to produce this data so to speak to mine this data and I think it's perfectly reasonable to allow data partners if you publish your first big paper take a couple of years and mine the data then release it as to why you would organize it in a shareable manner I think I agree with you sharing data primarily with yourself if you want to be able to go back and mine that data better be organized that's pretty simple but I think the disincentive to hard experiments is something serious if there's no further questions I think we're running out of time let's thank our panel popcorns thank you you you do I think for anyone it's still oh it's a huge time to think but you know something we're trying to get people to put data in the public databases one thing I hear a lot and this is an interesting projection is well I would definitely put lots of data up and that's that's a big thing perhaps something else