 All right so I'd like to introduce our first talk session of the day. So the way all the talk sessions today will work is that we'll have three speakers. They're each going to have about 15 minutes or so to speak to you. And then we'll save most of the questions until the end of the session. So we'll have a few minutes for clarifying questions or something that you've just got to know at the end of their talks. But then we'll ask you to save your bigger discussion points for the end of the session when we'll ask all three speakers to come up here to our bar stools and join us for a Q&A together. So with that said I'd like to bring up our first speaker and let me get her slides up. So I'd like to introduce Lisa Federer, the data science and open science librarian at the National Library of Medicine. Thank you and thank you so much for the organizers for having me. I'm really excited to be with you this morning. So today I want to talk a little bit about data sharing and reuse at the National Institutes of Health. Some of the initiatives that we have going on and then talk a little bit about my own research into metrics for data reuse. A little bit of contextualizing information if you're not familiar with the National Institutes of Health. We are the primary biomedical and public health research institution in the United States. We are the National Institutes Plural because we are comprised of 27 institutes and centers that focus on various different aspects of human health and research. We do send out most of our funding to our extramural research programs out to institutions that you were probably all coming from. And we also have an intramural research program at the NIH primarily on our main campus in Bethesda and some other locations with nearly 6,000 scientists working on various issues. The National Library of Medicine that I come from is one of the institutes of the NIH and we are also the world's largest biomedical library. We do things that you would probably think of a library doing. We collect books and journal articles but we're also a major resource for data. We house the National Center for Biotechnology Information which has a number of different data repositories and we are sending out over 115 terabytes of data per day to over 5 million users worldwide and we're also bringing in quite a lot of data every day over 15 terabytes per day from around 3,000 users around the world. As a library we have it as our mission to make open science and scholarship more findable, accessible, interoperable, and reusable or fair. And we're also interested in making sure that digital research objects are attributable and sustainable. People can cite them and give credit to the people who created it and also that we have good systems for sustaining access to these resources. The NIH on the whole has a long history of dedication to data sharing. Our mission is to, oh my slide got cut off there a little bit, to seek fundamental knowledge about the nature and behavior of living systems and the application of that knowledge to enhance health, length in life, and reduce illness and disability. And data sharing is really foundational to that. So we have some formal policies but really the sort of long-standing philosophy is that the results and accomplishments of the activities that we fund should be made available to the research community and to the public at large. We have a number of different data sharing policies that have been in place since about 2003. And breaking news, just yesterday the draft plan for data management and sharing plan policies went out for public comment. So I will tweet out that link to the conference hashtag when I get done talking. That will be open until January 10th for public comment. So I really encourage all of you to give it a look, make some comments, take it back to your communities. So in addition to the policy piece, we've also supported a number of different infrastructure projects related to making data open and available. One of the first was the human genome project, which we made that data publicly available. And since then we've followed along with a number of, and these are just a small sampling of the many different repositories and data resources that we support and house at the NIH. We also have a, oh good, it's not cut off on that screen, only on my screen. We also have an NIH strategic plan for data science that we began last year. This was requested by Congress and we actually do report back quarterly to Congress on these various different activities that we're undertaking to enhance data sharing access and interoperability. So these different initiatives fall under these five broad categories. And you can see that these really span the range of different sort of things that we need to think about in order to make data and science generally more open and more accessible. Also within the National Library of Medicine strategic plan, data science and open science is a major focus for us. Our activities fall into three broad goal areas, including infrastructure and developing that for information and for data. The user experience of getting data and information to people when they need it in the form that they need it. And also developing the workforce and providing education to the public and other people of interest to make sure that we are prepared to use that data that we have available to us. So this has been just a little snippet of the many activities that are going on at the NIH and the NLM to help ensure open science as we go forward. Policy and implementation is a big part of that. Citation and incentivization is something that we're also looking at at scale curation and how we coordinate and partner with other funders and organizations to ensure that not just data but all of the research outputs are as fair as possible as well as attributable and sustainable. So, you know, we have a lot of things going on to make sure that people can get access to data and other research products. But what are the impacts of doing that? Once we have all of this data available, what's actually happening to all of that? And that is where my research focuses. So I'm interested in figuring out who's reusing this data that we make available. What are the topics of data sets that are most reused so we can potentially focus our curation and preservation efforts on those highly reused data sets? When in a data sets life cycle is it reused? Do reuse patterns of data sets look similar to potentially what we see in patterns over time for citations to articles or is there something different going on? Where in the world are these data sets being reused? And why are researchers reusing data? What are the things that they're actually doing with them in their projects? So it's really difficult right now to track what happens to data that has been made publicly available because we don't really have great infrastructure for doing that. We have good technical infrastructure for knowing when I put an article out there who has cited that we don't have that yet for data sets, although that is something that many different scholarly communications communities are working on. So in the absence of a good way to reliably track where data sets are being cited, in my research I used data requests as a proxy for reuse. So looking at three different repositories that have sensitive human data so you can't just download and use it, you have to put in a request actually stating what you plan to do with it, you even have to have IRB approval from your own institution. So because of the fact that these are, it's a pretty robust process to apply for and get this data, I think that this is a somewhat useful proxy for reuse. So the three repositories that I looked at, one genomic, the database of genotypes and phenotypes which is housed within NCBI at NLM, and then two clinical repositories, one for the National Heart, Lung and Blood Institute and the other for the National Institute of Diabetes and Digestive and Kidney Diseases. Altogether these include several thousand data sets that have been requested over a hundred thousand times. So looking at this I was able to find some patterns in what is going on in terms of reuse. So one of the things that I found and this is part of a much larger study so I'm just going to hit on a couple highlights here, is that genomic and clinical data sets are reused in different ways. Genomic data sets in the study were more often used in meta-analysis so getting multiple different data sets and putting them together for a new study whereas the clinical data sets were more often used in the context of an original research project so getting a single data set and using it to ask a research question. This makes sense if you think about the sort of way that these different data sets or data types work. Genomic data for one thing you need a larger sample size to get meaningful statistical observations from that and genomic data is also much more standardized generally speaking than clinical data so it's feasible to merge multiple data sets for analysis in a genomic context more often than that is the case in a clinical context. I was also really interested in looking at quantifying the similarity between data set topics what the topic was that the data set was originally collected for and the context in which it was reused. One of the concerns that researchers often cite for not wanting to share their data is that they're worried that they might get scooped. They put their data out there and somebody gets it and finds the great discovery that the original collector would have made and they don't get credit for it. So my question was is that really the case are people getting data sets and using them for the exact same reason as they were originally collected and to quantify this I used a method called semantic similarity so the data sets were all described with medical subject heading terms and I used the NLM medical text indexer to apply those same terms to the requests so I could figure out what are the topics of the requests what are the topics of the data sets and how similar are those because the mesh ontology is in this sort of tree form we can quantify the similarity between two terms so suppose that we have a data set that is described as being about heart diseases and the reuse request is also about heart diseases those would have a semantic similarity score of one exact same topic on the other hand if the reuse request was about something like informatics that's on a completely different branch of the mesh tree so that score is zero they're nothing alike and then in between things could be more or less similar depending on sort of where they fall on the mesh tree and so therefore how conceptually similar they are so what we see when we look at this is that again there are differences in the ways that clinical data sets are used versus genomic data sets so on the left here items that would score zero they're nothing alike on the right items that scored one they're exactly the same topic as the reason for which the data set was originally collected so in addition to these being somewhat different between clinical and genomic research what I think is also interesting here is that a not insignificant number of these have a semantic similarity score of zero people are using these data sets for an entirely novel topic completely different from what the original data set was intended for and we also see a pretty broad spread particularly in the genomic reuse of these different topics so while there are quite a few particularly in the clinical data sets that do have a semantic similarity score of one they're identical topics that's not the case for everything so we're not seeing people just taking data sets and reusing them for the exact same thing looking at where data sets are reused around the world we do see that these data sets are going out to many different countries but they are primarily being used in the US kind of makes sense these are US based repositories and we also see that the most overrepresented countries are almost all English speaking nations so Canada the US of course Australia the UK this again probably makes a little bit of sense you have to submit a request in English all the documentation is in English of course many people outside of English speaking nations also can speak English but I think it kind of makes sense a little bit that that is the case here in terms of the temporal pattern of reuse what I find really interesting here is that early requests to data sets are highly predictive of later reuse so what you see this top line is data sets that are in the 90th percentile of all requests those are already way way more requested in the first year of their life than data sets that are in lower percentiles of requests I also did a regression analysis on this and even controlling for the age of the data set about 75 percent of the variability in how many requests a data set gets over the long term can be predicted by looking at first turn first year requests only so this I think is useful information to know that those things that are already highly requested in the first year are going to likely go on to be highly requested over time a topic of the data set is also predictive of reuse I won't get deeply into the method that I used here but I did topic modeling to identify clusters of data sets that are similar and what we see here is the dotted line shows basically the number of requests related to the number of data sets that have that topic so anything that is above the dotted line is over requested based on its representation within the repository as a whole anything below it under requested based on its representation and what we see here is things like blood and cardiovascular diseases are much more requested than less common things like congenital disorders rare inherited diseases so this suggests that the more common the disease the more likely someone is to request it which kind of makes sense so this has been just a small sliver of a much larger study but a couple sort of takeaways that I want to mention here one is that researchers are reusing data so there is actual you know use of this improved documentation potentially in non-English languages could potentially increase reuse generally and also in under-resourced countries where we're not seeing data reused as much increased interoperability of data appears to be associated with higher reuse across a broader range of topics as we see in that genomic data and curation and preservation decisions could potentially be based on early interest in a data set because those first year requests are so predictive of overall requests we could make some pretty evidence-based decisions about which data sets we want to focus on for curation so I'll end there I just want to acknowledge my colleagues in the Office of Strategic Initiatives at the NLM and I think there's a couple minutes questions any quick questions for Lisa thanks again Lisa so my question is you had a slide talking about data that's highly requested is also highly reused I found that really interesting from say an alternative metric or attention gathering for the data does the NIH repositories that you mentioned do they showcase the amount of requests the data has received is that information available to users to see that information or is that hidden internally with the repository it depends on the repository so the dbGaP data for example I just got all of that from their website you can go on their website and get their requests the other two I did get those directly from the repository so it kind of depends we have many many repositories at the NIH so they're not all doing the same thing but yeah I think definitely the data sets that are already getting a lot of attention people are probably more likely to pay attention to them so you know one of the things that I think would be interesting going forward not just for NIH repositories but for all repositories is having that info available for people thank you all right I'd like to introduce our second speaker for the session Alexander Mathis currently a post-doctoral fellow at Harvard and co-developer of deep lab cuts and in 2020 he'll be starting his own lab group at the Ecole Polytechnique Federal de Luzon Alex well thanks for inviting me it's really great to be here and to talk about a software package that we have been developing over the last one and a half year and that is already widely used and has been great fun so I think there will be I think in this forum it's also a great forum to talk about this software package because I think there will be two aspects where you'll see how open science was really important in influencing this project and how it then also changed kind of what others could do based on it so the general problem that deep lab cut is trying to deal with is the analysis of behavior you all know that humans have always been interested in studying behavior and historically that has been very difficult of course because there were no high-speed cameras there were no things like this and this is for example the first high-speed recording of gate and that was allowed us to answer scientific questions like two horses have all the four gates off the ground now what is interesting about the analysis of behavior for example in contrast to many other things that we want to analyze is that actually computer vision is extremely hard namely if you want to teach a computer here to detect the hooves and tell what the gate is then that's very very challenging problem and of course a lot of different approaches have been used in the past like model-based fitting here from a famous paper by mar and also other approaches but for a long time kind of marker-based tracking so the idea of putting markers at the relevant points of the body of an animal was kind of the gold standard in order to do for example locomotion analysis and there are of course many problems with putting markers on animals or humans namely it changes kind of the behavior in some sense after the fact you cannot look at other body parts and how they move in a study but you because you already picked kind of the points that you record from so these are all issues with these methods and and so one thing that was amazing was actually in the last five years roughly deep learning has changed marker-based tracking and what has become possible and I want to highlight this by showing a video of a very famous algorithm that was actually developed here at CMU called openPost as you can imagine it's an open science algorithm and what you see here is a bunch of humans dancing in Sydney and this algorithm is being applied to the scene the algorithm has not been trained on that scene the algorithm just detects all the different body parts of all these dancers in this very complicated scene and this is really remarkable performance when you when you think about it and so here I kind of want to briefly highlight the history of deep learning for human pose estimation so I think the first paper on this topic came out in 2014 and then relatively quickly the performance of these algorithms got extremely good until deeper cut for example in openPost came out in 2017 and there were really a plurry of papers like 4,000 papers on this topic were published and very quickly and the thing that is interesting about this so one thing that I think made the advance of this field so quick is essentially that these people they compete on benchmarks that's computer vision I'm saying and they all share the code essentially so they build very very quickly on top of each other and then achieve such astonishing results now very briefly how do these algorithms work essentially they take in the image and then there's a predictor that predicts the pose of the human which is just the x and y coordinates of all the different body parts and then the trick here is essentially that this predictor now is a deep neural net so it's very very deep as many stacked layers and it's trained on a lot of data in order to predict kind of these body parts but then in comparison to previous methods for pose estimation there are a number of extremely appealing properties why these are useful for other fields namely they work in the wild as you can see like you can do this pretty much in any situation I think this is in rehearse mode so it goes at a different timing yeah so they work in the wild they're extremely robust in contrast to other computer vision algorithms they're they're relatively fast they require no body model so whatever you annotate the algorithm will learn to basically predict where the body parts are in principle and they require no manual tuning and so that was something where we so I'm actually a neuroscientist and we are interested in studying the behavior of animals so we wanted to so we need algorithms like this to measure the behavior of animals and so one of our major insights in our paper that came out last year was essentially that we showed that you can detect body parts like here this is the snout of a of a mouse and the ears for example you see 10 pixels you can predict the position of the snout with let's say less than 10 pixel accuracy if you only annotate 50 frames which just takes you maybe 15 minutes to click on 15 50 snouts and then you can predict the accuracy of that body part with that precision on like days of recordings of behavior so what we also showed in that paper is that that is not just a feature of of snouts of mice but you can in fact use this for many different behaviors this is a reaching paradigm in mice where a mouse is reaching for joystick and this also had just 140 frames annotated and so now this ease of annotating very little data and getting remarkable feature detectors was something that I think had a lot of people used to code very quickly and there were really a lot of interesting applications I think also going back to the introduction of this session I think what is very interesting is old videos can of course be reused and reanalyzed with these new algorithms very quickly that's a matter of 20 minutes we also got connected contacted for example by Mir Patel who studies cheetahs which is wonderful thing to study and it works well in these contexts as well so now specifically what is interesting about this type of software and what some people have called it kind of software 2.0 in contrast to other software where you kind of program let's say how the algorithm should detect where particular body parts are here the software programs itself in some sense and so the specific needs that one then has is that that at first there are a lot of things popping up here security I don't know sorry it's not my computer so so the user first kind of let's say if you're interested in tracking multiple body parts of the hand here of a mouse then the user can create a project can extract different frames where the hand is in different postures can annotate these different body parts and then select a particular select a particular network train this deep neural network on this data so in order to predict from the image these body part locations and then perhaps if it doesn't work too well one can kind of refine and add more annotated examples to the to the stack and retrain or yes train again and then once the network is trained then you can of course use it for inference on lots of other videos and it works relatively well this is of course a simplified workflow of kind of how deep lab cut works where you kind of have a tight integration of the annotation data the refinement of annotation data the neural networks that kind of operate on that data and so on and so forth here is a more elaborate scheme that shows kind of canonically the path would be you create a project and you go down to analyze your data but let's say if you over the long term expand your project a lot and include differently looking mice then you can kind of go along this loop path and include additional annotated frames to to make the performance good enough and so deep lab cut is an open source package that is built on a lot of open source packages that's really something that one one should highlight here and that we have also made sure that actually runs across a lot of platforms and I think that's also something that's come up in earlier talks I think it's it's really important that software can be kind of used in the cloud can be used in darker containers so they extremely shareable and also local university clusters and so on and so forth and even the same project as you will see in a second can be kind of run on different platforms and projects can be shared and the weights can be shared and then another type of integration that we have for deep lab that it's integrated with a lot of other packages for kind of logging results for mining results for then doing later behavioral analysis on top of all these pose estimation data and on this next slide I kind of dig into some of the features namely periodically we kind of update the code and put it on on what the latest version on github and we see that there's a lot of chat there and there are quite a few contributions that people have made to the code especially for kind of most of the contributions are kind of contributing to making it work broadly across lots of platforms because we mostly work on linux so that's actually quite a challenge to get stuff work also equally well in all the other systems but there have been a lot of downloads there have been a lot of forks meaning potential contributors that actually take this open source code and flexibly change something in order to adjust it to their specific project and then as I said I think for the user experience what what is quite important is that projects can be shared both with respect to annotation data but also with respect to the weights so if you have a very good network that can detect the hands of a mouse and you can share this with your colleagues and then they can do the same very quickly which really contributes to kind of reproducibility of research because you know that someone will actually analyze the same body parts in the same location relatively and so on and so forth we also made sure that there's a simple workflow will go into this a bit later and that there are actually multiple ways to interact with the code depending on kind of the background of the users namely the kind of more programming terminal interfaces and then of course a graphical interface that it will actually highlight in the next slide and then I think what is also important for a project like this to have to help people kind of get this off the ground is to have example projects that are fully worked where they can kind of play around in the cloud actually they don't need to install anything and see how it works on let's say our data that is already annotated and so here you have an example video of the GUI of deep lab cut where a project is being started and so here there's just some metadata that you put in and then the GUI comes up so you can adjust the body parts to the ones let's say that you care about you can zoom in make highly accurate adjustments then in the next step create a training set where you split the data into two parts you can pick different neural networks you train it and so on and so forth and I think one thing that I felt was that we realized was extremely important many of many of our users extremely experienced programmers but some of them maybe are not and I think one thing that like deep learning software has been really amazing but having been able to kind of bring this into the hands of people that maybe are not even able to program that well I think was something that is very important and I think for us what really made this possible was kind of to make sure that we write a very clear cut protocol that has like okay here are seven minimal steps if you do this it will should work on your data and as you can see kind of in the amount of downloads that this this user guide has received that's really something that I think was useful for a lot of people then we are also very fortunate to be on the community where community partner of the image forum where people discuss problems that they have or kind of challenges and as I said the code is on github and quite a few contributors external and then of course there are lots of contributors internally to the package in the updates that's the forum then there will be twitter another thing that we made good experience with was to have training workshops and in the future we'll also have hacker funds on the software so this is actually a picture from a training workshop where I was invited by students to Warsaw to talk about the software and I think that's also especially for developers something very useful because you get a direct contact with with people that use this or want to use this and you figure out kind of what is maybe not working what is not intuitive and how you can then kind of optimize your software and I think one thing that I find greatly enjoyable is our twitter feed because people share their results and what they work on and there are some really interesting applications like for example I would have never thought that someone would use this so this works also well for tendon tracking as an example I don't know I think this doesn't really well okay so with this I would come to an end and thank the other co-developer Mackenzie Mathis it's also moving to epiphyl and then a lot of people have been involved Matthias Bethke Tanminath is a postdoc in Mackenzie's lab and then a number of students Tom and Mert and so on and of course I want to also thank my postdoc supervisors at Harvard, Venky and Matthias Bethke and lots of collaborators that made sure that this is an exciting project for their data while you've been developing this and doing these hackathons what have you learned about making it possible for someone to get under the hood and repurpose your work in ways that you didn't intend while still making it easy for someone who doesn't have a programming background to to have success with. Yeah I think so the whole software package is like open source Python package and I think on the one hand there so there are a lot of functions that there's kind of I would say almost two types of APIs that the software has one that really goes very strictly with kind of this nature protocols okay here are n steps and you will get your result and it will even plot it and so on and then I think there's actually there are many many more functions that are differently kind of documented and that people dig into or can add functions and tweak it. Hackathons we actually have not done yet we will have one next year so I'm very much looking forward to that and to kind of really interact more with experienced users and also developers of other packages so that there will be a tighter integration between packages that do like neural data analysis on top of let's say behavioral analysis things like that. Is this being used for any clinical applications and are people accepting this in clinical fields? Using this for clinical applications? Yeah so there are quite a few there are quite a few labs that use this in clinical application for like analysis of stroke like behavior post stroke or tremors and things like this. The license is fully open source and free it's free free to use and and so it allows these type of analysis methods. I think another thing that is actually quite interesting is that let's say once you train the algorithm on human data of post estimation data then when cannot if you don't share the annotation data itself one can actually not reproduce the data that went into the algorithm so that's also something that is actually good for privacy of patients. Do you find the clinicians are accepting? Oh well so I personally I don't work with any clinicians on this but I know that there are quite a few labs that use it with clinicians both at Harvard and many other hospitals actually. My question is are you planning to extend this in kind of like a transfer learning framework? So you can imagine a lot of behaviors would probably be very similar or on the same organism so would it be possible for somebody to take a model trained on something very similar and use that as initialization per se? Yes so that's actually something we're doing very actively so we will have a model zoo where you can for certain paradigms in neuroscience that are extremely common like a mouse running in a box or a mouse reaching you can basically just download weights that will be trained on data for many different labs and therefore actually fairly robust and will work in many contexts and otherwise you can just retrain with a few frames so that's certainly something that we are doing and that will also be released soon. Thank you. Okay I'd like to introduce our third speaker in the session, Casey Green, who's an associate professor at the Perlman School of Medicine, University of Pennsylvania, the integrative genomics lab, also the director of the childhood cancer data lab and of Alex's lemonade stand foundation. I realize I should have been reading you the titles of the talks as well. He'll be talking about machine learning for rare diseases and the role of open data. Casey. Thank you for having me. Yeah it's fun to be here to share a little bit about the work that we're doing. I was asked to share in particular some of the stuff that we're doing in the childhood cancer data lab so I'll just give you a little bit of an idea of what that is. So just to give you an introduction I think the easiest way to explain what the childhood cancer data lab is is the sort of eye and the rest of the team work on a data science team that was essentially began with a four-year-old girl. So this is a picture of Alex and Alex was diagnosed with neuroblastoma at the age of one. When she was four she enrolled in a clinical trial and she told her parents okay when this when I'm done being in the hospital I'm going to go home and I'm going to start a lemonade stand and I'm going to raise money for pediatric cancer research. Her parents were willing to say yes to whatever she asked for at that point because she didn't have a very good prognosis at the time. She did end up going home raising money through a lemonade stand. She ended up raising $2,000 at the first lemonade stand. By the time she died at the age of eight she had raised a million dollars for pediatric cancer research and then her parents took her dream and continued on with it. So the lemonade stand that she started became a nonprofit foundation which is now called Alex's lemonade stand foundation. It's now raised more than $150 million for research and funded more than 1,000 projects at 135 different institutions. So this is sort of the story when we got involved with Alex's about two years ago and what they had identified was there was a gap in the field in that people were having there were large data resources that were becoming available but it was very hard for people to take those data resources and make discoveries with them. So two years ago they founded something called the Childhood Cancer Data Lab which I am helping them to launch. So the Childhood Cancer Data Lab's mission is to empower pediatric cancer researchers who are poised for the next big discovery with knowledge, data, and tools. So in the sense of knowledge we run workshops. So Qua Chen actually came to one of the workshops that we offered last month and will be delivering the same information and instructional materials here as we're trying to figure out how to scale those by having materials offered at additional campuses that aren't just ours. Data so we try to build data resources that I won't talk about too much and then one thing I will talk about here is sort of tools and how open data really powers a lot of the things that we're doing in the Childhood Cancer Data Lab. So this is a gap that we identified early on so we hired a user experience designer as one of our first hires and what they found was that people are pretty good now and analyzing their own data. Where they struggle is if they need to connect their data to other people's data or if they need to connect multiple data sets together and particularly this is challenging if there are multiple tissues or they have to find commonalities between disease models and tissue samples. And in particular the gap they were addressing is if you analyze multiple data sets and you find a certain disease module in one data set what you really want to know is in this other data set is this the same module and that analysis is very, sorry I have daycare long as well, very consuming of time and attention so it takes a really well trained analyst about a year or two to actually align multiple data sets and try to understand if the same processes are at play in those. And so what we identified so working with someone who's the so Jacqueline Taroni, Dr. Taroni is the principal data scientist on the Childhood Cancer Data Lab team what we sat down and thought about was wouldn't it be great for these rare disease data sets if there was this reusable module library where you could take these modules and look in individual data sets and see if that module was present in data set X, data set Y and data set Z and because you'd be reusing these modules you know it was sort of the same module. And so this all sounds wonderful well and good since I'm here you know we're using machine learning I have good and bad news we've had good experience but it's you know it's good to remember that we want to think of machine learning sometimes as a panacea to all of our problems we have not generally found that to be the case we think that sort of you often have to design a unique approach to each individual problem and in some cases it's just very hard to to use it effectively. In this case you know we knew we'd be facing a bit of a challenge in that machine learning approaches tend to benefit tend to do really well if you have large numbers of examples. So if you think of this as your data matrix what you'd really like is a data matrix that's sort of very long and not very wide and what I mean there is you know some things so in this case a modest number of sort of highly relevant things but you know them about many many people so you know you might know 30 things but you might know 30 things about 30,000 people this is a case where you can really build very accurate models. Rare diseases so pediatric cancers though they're collectively deadly are individually rare diseases the challenge we have in this setting is that even though we can profile the genome and we can do these genome-wide assays we can now know many many things so we could measure the expression levels of 25,000 genes in the human genome but we're only ever going to measure the expression levels in a modest number of kids because even though pediatric cancers are the largest killer by disease of kids under the age of 18 there are many different cancers and individually each one is rare so we're only ever going to know that about a small number of people and so what Jackie developed was an approach where we make machine learning models that aren't just designed for one pediatric cancer one biological context they're designed for many biological contexts and this is where the data reuse comes in so her hypothesis was that the meaningful patterns that we'd want to discover wouldn't be necessarily unique to a single disease they'd exist across many diseases in many different settings it's just that they'd only work together in a certain way in one disease but you could rediscover you could discover them anywhere and so a more formal way to do this is to say okay I've got a whole bunch of samples and genes what I really want to learn are patterns so I can go to samples and patterns and then what I'm going to do is I'm going to take those patterns to rare disease and I'm going to ask are there patterns that are significantly associated with having a disease or or not having a disease can you wave at the people can you say hi can you say hi they're all saying hi so so this was her approach to sidestep the challenge I've lost okay there's the answer and so we worked with public data so this is again genomic data it's not from dbGaP but we downloaded data from recount so this is publicly available you don't actually have to go through a data request and it's actually we're probably tertiary users at this point because it's originally hosted in SRA it was reprocessed by Jeff Leakes group at Hopkins into something called Recount 2 and then we're using it here so I don't know how this ends up getting tracked by the National Library of Medicine but I think this type of use is probably also really important so maybe we can figure out a way to better track these types of things as well in this case we downloaded 70,000 RNA seek samples that Jeff Leakes group had used rail to reprocess if you just want to back of the envelope what it would have cost for us to generate these samples at Alex's we guesstimate that this would have cost about a thousand dollars the sample just in terms of sample handling costs and so in total this data set in terms of value to us we thought was about a 70 million dollar data set and you know if you think about the scale of Alex's that's a really valuable resource to just have at your fingertips with an internet connection we analyze this data set say hi again can you say hi again the room says hi using something called Plyer so this is a great town to talk about Plyer in because Plyer is actually published by Maria Shaquina's group at Pitt so just down the road so if you like this story go talk to Maria her work is awesome so this is when it was a preprint it actually oh it just came out in nature methods I updated the slide Plyer is essentially let's just say learning these patterns can I put you down if I put you down so Plyer is essentially just learning these patterns and it's just learning it in a way where it's trying to it gets a small reward if those patterns align to processes that we care about and it also tries to make the gene to pattern mapping sparse so not every gene is connected to every pattern there's sort of a subset of those connection of those connections and since everything in computational biology needs a name we considered multi data set Plyer we call it multiplayer and this is a Jacqueline's work is sort of making the connection to multiplayer so I'm going to just give you a brief idea of sort of why we think this model works really well so this is a series of three experiments that we did trying to understand it what I'm going to show you here the right box plot is a data set comprised of all the whole blood data we could find for lupus so this is all the whole blood lupus data we could get what we wanted was a data set where we could build the type of collection that people generally use in this type of research the left box plot is if you take recount two and you sub sample recount two to be the size of the lupus data set and the diamond on the top is if you take the entire collection to recount two so from an experimental point of view what we're talking about here is these two box plots are the same size data set but the one on the right is what you get if you sort of do the type of analyses that people are always doing the one you get on the left is if you just download generic data from the internet essentially and then these two the diamond in the box plot in the recount two column if you have the same type of data but very different scale so this is the top one is the 70 000 samples and then the bottom one is just sort of the data set that's matched to the lupus set and so now I can walk you through what we're finding so this is looking at what we're discovering the number of latent variables so essentially just how many patterns are we able to pick up we're able to pick up many more patterns when we have the complete collection of data that doesn't appear to be driven by sort of the number of samples so those are relatively similar maybe even a little bit lower in the recount two set but is really driven by just the scale of the data set so that's good more data more patterns this kind of makes sense from a statistical power point of view as well so we would say okay we learned more total things we can also ask because we put pathways into plier and it gives us some fraction of them back we can ask how many of the pathways that we told plier about came back to us if we have these modest sized data sets here so the lupus data set of the recount two data set we can see about 20 percent of the things we told plier about we got back with a complete collection of data we get about just over 40 percent of the things we told plier about back now this still says there's 60 percent of the sort of pathways that we think we know about aren't coming back so there's a few different reasons they may not be transcriptionally regulated we may not yet have data sets that perturb those pathways they wouldn't even if they are transcriptionally co-regulated we may not see them and finally they they may just be they may not be sort of coherent at the transcriptional level they may have individual parts that are coherent even if the whole pathway is not coherent at the transcriptional level so there's a few different reasons we might not find sort of the rest of the pathways but the good news is more data and you find more sort of pre-understood knowledge coming back that's what we're hoping for and then finally the question is are we only getting more of the things we should know about are we actually learning some new things too and we find that we actually learn more new things as well thank you you're really bringing me some water um so so this is asking um what proportion of the latent variables or patterns we're getting back aligned to prior pathways so with the SLE and recount models you can see it's about 50% of the pathways a lot of the latent variables we got back aligned to some biological pathway we already knew about kind of intriguingly with the complete collection of data it's only about 20% so we're getting many more patterns back and not all of them are aligning to these pathways so we might be discovering sort of biology that really only becomes a parent at when you have this scale of data so that's kind of exciting so we also learn a lot more these unknown unknowns so just to give you the short summary of this part of actually this talk so the machine learning analyses that reuse data from other from other settings we find actually reach a level of detail that's otherwise impossible and I know the question came up before about using transfer learning um in I don't I didn't go into the sort of how we're applying this here um the disease we're looking at here we only had three data sets with about 30 samples in each so we didn't end up using it in a transfer learning context where you sort of take the data set port it in and then revise the model a little bit but we did transfer the data set into the model and we get much better results sort of working with this large collection of data first and then porting that model into the rare disease data set just to give you an idea of what's downloadable now if you happen to have an internet connection and you want to go above and beyond recount to our estimate is there's about 3.8 million genome-wide assays available if you want to think about what those would have cost to generate in the beginning that we estimate that would have been about a 3.8 billion dollar effort to generate those data sets so would highly recommend that if you do sort of biomedical research and you're really interested in genomics that you you take advantage of this resource because I don't think many of us are going to get four billion dollar grants and I think you know being able to have that at your fingertips is just enormously valuable um the childhood cancer data lab has been trying to process as much of the publicly available data as possible and we've actually there's about 1.9 million samples that are on platforms that we can process of those we've been able to process one and a quarter million of them so in the next month or so we're going to release large compendium with about one and a quarter million samples uniformly processed for people who want to take it and use it for downstream work with that I just want to thank the people who make this possible the work that I talked about was from the childhood cancer data lab the team there Josh Jackie Chante and Candace or the data science team Jackie led the multiplier project that I talked about we also have Deepa's our user experience designer who identifies the gaps like the one where it's hard to reanalyze data sets so I didn't get to take you through the sort of how we're really putting that model to use but that sort of multi experiment comparison is exactly the type of stuff we're doing and that's one of the things Deepa identified and then the folks who build the infrastructure David Ariel and Kurt have been putting the finishing touches on our system to download and uniformly process one and a quarter million samples with that I know we only have time for a couple questions but I'll have we'll have the panel later and be happy to take whatever questions you have hi um so you mentioned like one of the gaps that people were finding was integrating their data or integrating other people data together um in your system I think it's on it sounds amplified to me but okay um did did you have to deal with um it was that because of batch effects and did you have to deal with that in multiplier to make it work um to make these learn these patterns across different biological settings yeah so that's a great question and so we this isn't the first time we've sort of used this type of approach so we've been doing this now for um I think we had our first paper kind of with this type of technique maybe seven or eight seven years ago now um using sort of different underlying methods but the same idea and at that point what we found is that if you work across these large compendia data you don't really have to deal with these technical factors and we always had this sort of underlying guess as to why this was um and we thought it was because the biology was consistent whereas the experiment to experiment or batch to batch noise tends to be sort of experiment specific so if you gather enough data from enough settings you actually end up the biology ends up washing out the technical artifacts so this was just a guess for many years and in the last six months I've had a student Alex Lee whose name is up here so we'll have I think a paper or pre-print coming out pretty soon where Alex has finally been able to rigorously show that this is what happens and there's a there's an entire world where you really really really have to correct for technical variation but once you get beyond a certain number of data sets it actually hurts you to correct for the technical variation instead of just letting the biology overwhelm it so yeah really interesting but yeah are you ready for your panel you've been sitting on the panel so I think um Alex in your talk you talked about how there was a flurry of papers published in this field and a lot of progress was maybe due to a lot of people buying per positions on a leaderboard so potentially they had an agreed upon standard data set and they could all push towards this do you think that in order to to have better progress that's a key piece is to have like an agreed upon baseline because I don't know maybe you've heard there was this study that they looked at some open data that had a baseline it was the Netflix study and movie lens I think and they showed that on one data set where they didn't have a leaderboard and agreed upon baseline the baselines that people were publishing comparing to weren't properly tuned and when someone then later went on and tuned those data set those baselines they showed that they could be any of the state-of-the-art methods that were published for the next five years whereas on the other data set where they had a standard baseline people were tuning and improving consistently so do you think that's like a necessary step to make sure that we have open data sets we need also agreed upon baselines and standards and sharing that as well I think I think that's very helpful no like basically both make benchmarks experts having agree on what the right metrics are I mean of course the problems with metrics if the wrong metrics and you optimize them like the impact factor then give bad consequences but I think in machine learning or in biomedical research that would be extremely useful and I think certainly in computer vision that has been extremely useful as kind of pose estimation shows which in itself for example a lot of the advances they're actually based on ImageNet which was one of the largest data sets ever created for computer vision for a long time at least and I think yeah so I kind of set out saying that computer vision is extremely there are many very hard computer vision problems and what is very interesting that the moment people got very large annotated data sets and started competing on them then many hard computer vision problems actually solutions have been advanced tremendously I think the example he was just highlighting too no like the data can beat everything in some sense no like if you have enough data and you have good learning algorithms then they will find interesting things beyond what we know so far and I think that's something that is extremely important and given that as was highlighted in the introduction that the amount of data we are collecting storing is just tremendously growing I think it's actually also just awesome in some sense that we have the machine learning tools to kind of tackle this data and that these data is becoming available and therefore I think we will actually make interesting insights but just to get back to this yeah I think we should have benchmarks I would just also add to that more generally so last year at the National Library of Medicine we hosted a workshop on data science drivers so basically bringing data scientists in who don't normally work on biomedical data and talking to them about what informs your decisions about what space is to work in what makes a data set interesting and compelling to you and that was one of the big things that they talked about was you know certain documentation for data that we wouldn't normally think of having for biomedical data so to the extent that we can have things like benchmarks and and other sorts of things like that we could potentially induce people who wouldn't normally work in this space to actually bring some other expertise so thank you all I think these were just fantastic examples right up open data open code enabling other scientists and Lisa you talked about in your study how often it's of the shelf use not on the same topic so my question is a couple of years ago New England Journal of Medicine had a famous editorial about data parasites that if you're reusing other people's data you're a parasite if you don't make them co-authors on your paper and of course with everything you just talked about Lisa that's impossible right you either share data enable others or not Casey had started an award for data parasites I think that's part of the Gordon Betty Moore Foundation how much progress have we made in these few years and over the last 10 years how far away away from completely eradicating that attitude that you're either co-authoring you get in touch with me or I'm not giving you my data so I can start I think that we do need to develop incentive structures that's a big piece you know we can have policies that require you to do things but I think an important thing is how do we correctly and appropriately reward people co-authorship I don't think is the right way so I think the the movement towards creating some of that infrastructure for data citation to enable tracking and also just you know letting people know how to properly do it I did another study looking at data citation in papers and it was really all over the place almost none of the the papers actually like had a formal data citation they mentioned it in the methods and the you know wherever so so putting some of that into place I think is important and then having institutions also recognize and reward that just very briefly I agree like I think just having different types of citations would actually be very interesting you know like in some sense it's different to have a citation in the intro of like a general phenomenon or this data was used in that study or this algorithm was used in that study and I think it would be good if if kind of metrics could reflect this if you could have not a one-dimensional citation metric but like 10-dimensional data citation metric and so on and so forth yeah I would agree I think the incentives are key I think you may have I know you just gave a talk at Alex's innovation summit I can't remember I think actually the talk happened before you were there but Alex's lemonade stand foundation has changed their grant making process so now the resources that get generated under the grant is a you know a key document like it is at the NIH but unlike most NIH grant mechanisms it actually counts for impact so if you have a better sharing plan your grant is more likely to get funded and that sharing plan includes both future behaviors so what you plan to do under this grant as well as past behaviors so like you know how how well can you describe your sharing in the past particularly compelling cases of reuse where you and it explicitly asked will you not be like where you were not a co-author on the papers and so that type of thing you know the hope is to really change the culture in you know it's a localized field so it's pediatric cancer research but to make it much more focused on broad sharing as opposed to like what's really common is clique sharing where you have sort of a small group of people that all share with each other with this sort of co-authorship network but very those are very hard to break into and we think I personally think I guess I should say not speaking on behalf of alexis as a whole but at least I personally think that those limit and slow the progress of science so hi thank you very much you all for the talks so I have a question for everybody really in general but so recently I have been more and more involved into discussions on how open science and open scholarship policy advancements are also like new kind of ideas that are proposed implemented can actually sometimes harm in terms of like harm like researchers who are not as resourced and so for open data it was brought up from a researcher in Ghana that I'm collaborating with the actual example of how he like the the main advantage that his laboratory has over the same project he's done at Harvard is that he has actually a lot of data regarding like malaria patients in Ghana so he when we were talking about like open science open data he was like you know I'm really skeptical about and afraid of putting actually my data out there because our lab is three people and if I put my lab my data out there a lab with 34 postdocs can really very quickly do it and these are labs that are also funded by NIH through the H3 Bionet project in Africa and so I'm wondering like if there is any discussions involved or plan to try to also mitigate some I don't know the quality level but like some of these potential harm that actually can come out of new open science solutions that because I don't think that just putting a policy out there I'm not saying this is what you're doing without thinking about this concern might actually work I don't have answers but I would say you know I think that concern when it comes to concerns around open science I think that's the one I find most compelling not necessarily that there is a lab that has collected a huge amount of data but particularly when those labs are in places where they would have sort of unequal access to resources and where they might lose competitiveness if they shared so I think on the funding side changing incentives to provide funding that enhances sharing I think is important now if a lab is in another country and that is funded by the NIH but you could imagine also labs that aren't NIH funded you know might not be in a place to benefit from that and so might there might be more challenges there so I I don't have any answers but I would say I think that is one of the things that also I find important in troubling so I think that would be a great thing to submit to the public comment period to the policy but yeah that I I think that the policy recognizes that there's not really like a one-size-fits-all approach to sharing that we would like data to be as open as possible but you know recognize that there's definitely some complicated issues with this and I don't have a good answer either but I think that that is something that NIH policy makers are cognizant of yeah so this this maybe is a question for for the rest of the day too but I'm curious about how how you accommodate the challenge of new data types and and sort of layers of organization and I'm I guess this is this is both an NIH question but also maybe maybe to to to everybody when you have a new data type that you know interfaces with the set that you are working with where do we begin and I'm speaking now from primarily I'm primarily an empiricist that that hacks a bit with open data when need be but I find that this adaptation is particularly challenging so I'm not totally sure I understand your question but I think part of the answer is standards so creating and oh go ahead if the standards don't exist so I think that that requires community effort to develop a standard and I think that there are often cases where that can happen by extending existing standards or modifying existing standards one of the things that we're also working on at the NLM is developing like like a core set of minimal metadata that would enable us to search across multiple different repositories so we have you know there are different ways that researchers in different fields and working with different data types go about discovering data so it's really tricky to try and get all of those things together in a way that we could meaningfully search across multiple data sets so that is a challenge that we've tried to take on and so I think starting there and thinking about like how maybe the data that you're working with might interface with other types of data could be potentially useful hopefully that answers your question a little bit I guess I think I'm guessing you're asking around where do you put those data how do you publish on those data you know I think now there are some really excellent general purpose repositories so for instance from the multiplier model we put that in fig share the two that we generally use are fig share and Zenodo and in this in in the case of this model the way the file size and everything worked it was just easier to put it in fig share I would use that as sort of your first place like host of data if there is no other good repository because at least then it's preserved and hopefully you put some description of what those data files are and what's in there you know I would hope that as these data types become more widely used there would become sort of a government funded single point of truth repository where you know hopefully one either you know if there was a set of things on fig share that had been very valuable they could get ported into that repository hopefully but then once those repositories exist so for instance if you're generating gene expression data with say RNA-seq I would hope you would put that in SRA instead of like you know uploading it to some generic repository because there's a lot more that people can do with it when it goes into these sort of well-designed repositories and then yeah I think figuring out where that ecosystem kind of starts to break down at the end is also probably important so what do you do with data like microarray data right which is still occasionally being generated but but less so yeah I think in the case of array data we might be doing okay because it's actually not that much data and so it's not that expensive to keep storing it but you know as we sort of move from sequencing types that are maybe becoming less and less widely used then I think the question is really going to come up of do we really keep archiving all of this and I don't know I took you from the beginning to the middle which hopefully takes you part right through that. So I would add to that also and thank you for for bringing that up so NIH is interested in and currently looking at figuring out what is the space of generalist repositories in the biomedical research data ecosystem so we do have quite a lot of pretty specific subject specific repositories that NIH funds or works with but we are you know we do recognize that yeah not everything fits within one of these so what is the best way to do that and again I think that work that we're doing on developing a common metadata model hopefully will alleviate some of the issues that people have in thinking about where to put stuff because ideally it wouldn't really matter where you put something like I shouldn't have to know where the data is to be able to find it right just in the same way that I shouldn't have to know what journal an article is in for me to find it we have PubMed you don't have to worry about that so the next step for us I think is like how do we think about developing something like a PubMed for data where it doesn't matter where you put it you'll still be able to find it. My question is the way it seems like the data scientists in the medical field collaborate because my background is in engineering and you would meet with like competitors at Uber, Argo AI would come together and talk about issues but then through my wife who works at I would be talking to data scientists at the immunology lab and then I would talk to another friend who's a data scientist in the dental school and they don't know each other and they talk don't talk to each other and I'm like you guys are dealing with similar problems I'm just trying I'm just curious what's the general trends in the collaboration of medical field and I have an engineering firm so I'm not too familiar with it. So one thing I'll say that again came out of that workshop that we held at NLM was that part of what maybe is some of the hesitance among data scientists to collaborate with biomedical researchers is that they don't want to be seen as like providing a service they want to be seen as co-equal collaborators so I think not this doesn't totally answer your question about conversation among different labs but teaching biomedical researchers how to meaningfully work with data scientists in ways that they are again co-equal collaborators and they're not seeing them as like a technician in the lab so I think part of that is teaching people the language to speak across disciplines which is a tricky thing to do and then I mean just in terms of the silos that is I think always going to be an issue and that's I don't know that I necessarily have a good answer for that but maybe one of you too does. Yeah I guess I would say I don't have an answer on the silos necessarily except that sort of posting interdisciplinary events and getting people into the room at the same time is probably helpful. You know I think the service there's a certain model in biomedical research that I think became very entrenched with biostatisticians over a reasonably lengthy period of time where to get your grant funded you had to have a biostatistician on that grant but you needed them for some percent of time but not that much time so people got spread across a huge number of projects in these types of service roles often and in computational biology and data science you know that often folks may be seen as similar to that model and I think it's bad for the field when that happens but yeah I think that's probably fixing that is also really important and fixing those perceptions is really important. I can't tell you how many times someone has come to me and said hey I can give you five percent effort on this R01 if you'll just solve this quick and easy problem and you know you look at the problem and you're like well that's at least two R01s worth of you know computational effort just to figure out if that's a solvable problem so like so I think figuring out how to fix that culture is probably important to getting more kind of buy-in across disciplines as well in terms of communicating about the problems and solutions. Yeah so the question was is the grant structure and incentive structure potentially causing the siloing? It would be a potential hypothesis I don't think I could discount let's put it that way. Yeah I think that question about incentives is an interesting one that was another thing that came up in this workshop again was that when you're talking about data scientists the places that they want to publish or the things that are meaningful for them to move forward in their career are really different from what a biomedical researcher has so that's a tricky you know question to think about when you're starting a collaboration is like what is a meaningful outcome for everyone involved? I guess like from the post-doc perspective in some sense I think one argument is always I mean first off the papers are the most important thing so people focus on their main project which of course what a big ability I think is suboptimal but yeah I guess we're gonna do one more contrast with industry you know many of the problems that data scientists are working on in on an industry those have your did this work or not and your payoff is you know six months to a year away maybe in biomedicine you know I think the there was a paper in PNAS not too long ago which says you know basic science discovery through drug that improved like translational improvement in people's lives is 17 years and so I think the lag and sort of that really long feedback window may also lead to some of this difficulty in aligning incentives at this point breaks and lunch and the people discussion going um we'll take a break now for some coffee we'll start back there's also water in the back of the room as well as coffee at the next