 Okay, hi everyone. Welcome back. We had some really great conversations this morning and I'm really looking forward to the afternoon sessions. Our first speaker in session three is Saskia de Vries. She joined the Allen Institute in 2012 as a scientist on the neural coding team. She has a background in systems neuroscience and has studied visual processing in both vertebrate and invertebrate systems using a combination of physiological, computational, behavioral, and molecular tools. And with that, I'm going to hand it over to you. Great. Thank you. Thank you for the introduction and for the invitation to speak. I've been really enjoying today's sessions and hearing from everybody about different aspects of open data and some of the challenges that we face in sharing data and using open data. What I want to tell you about today is some of the work that we've been doing at the Allen Institute. So for those of you who know about the Allen Institute for Brain Science, we are a non-profit research institute and our mission is to accelerate the understanding of the brain by creating public resources that we create through a big team science approach. But then we make this data freely available to the community that they can use then to accelerate their own research. And so over the last decade plus, we've generated a number of different data sets that you can see many of them on this timeline. And the one I'm going to tell you about is this one on the far right, the Allen Brain Observatory. And this is our first data set that is in vivo physiology data set. A lot of our previous data sets were focused on things such as gene expression in different structures of the brain of the mouse or the human. Then we were moving into looking at data, looking at connectivity, the connections between different parts of the brain. But this was our first foray into collecting data in the, you know, awake alive animal and generating a data set with this. So I'm going to tell you about our two photon data set. So two photon is a data collection method where we use calcium indicators and we can use a two photon microscope to image the activity of large populations of neurons simultaneously. And we set out in creating this data set to create a survey of physiological activity throughout the mouse visual cortex. As I mentioned, a lot of the previous data that the Institute has generated has been about gene expression or cell identity. And here we want to look at the activity of neurons so we can start asking questions about the computations that are going on in these neural circuits. So we wanted to be able to survey across different areas, different structures in the brain. So this is an image of the mouse, the surface of the mouse cortex. This is our primary visual cortex surrounded by other visual areas. And so we collected data from about six different visual areas. We also collected data across a number of different cell types. And so we could use genetic tools that are available in the mouse to limit the expression of our calcium indicator to particular subtypes of cells that are expressed in particular areas or in, say, excitatory cells or inhibitory cells so that we could start to unravel different functions of these different types of cells. And then because we're interested in the computations and visual representation, we use a wide range of different visual stimuli so we can look at how stimulus statistics might affect some of these computations. So this is an example, let me play this movie, of what the data collection looks like. So I mentioned before, we're using a calcium indicator and so this is a fluorescent indicator that whenever a cell fires a spike, calcium floods into the cell, it binds to the indicator and then the cell fluoresces. And so you can see in this movie, this is the movie we collect through our microscope, there's probably a few hundred cells in this field of view and different cells light up at different times depending on what's being shown to the mouse or whether the mouse is running or things like that. So we're collecting this movie of calcium activity at the same time that we're showing different types of stimuli to the animal. I'll play that again. So sometimes it's noise, sometimes it's movies. And so you can see the types of stimuli we show. We also track, this is the eye that's pointed at the monitor for the mouse so we can track where that pupil is located. That's what this little red dot indicates, the times when the eye moves. As well as we have information about, say, whether the mouse runs or is stationary during the experiment. You can see here's a place where the mouse starts running. So this is what one session looks like. We collect this through, I mentioned before, kind of a big team science approach. We have a standardized high throughput data collection pipeline, where each stage of our data collection is carried out by a dedicated team of technicians. So for instance, generating our transgenic mice, then we have a team of surgeons that put a cranial window into the skull so that we have optical access to the cortex. We can use intrinsic signal imaging to create a map of where the different visual areas are so that we can target our data collection to the specific areas that we are looking to get data from. We spend some time habituating the mice to these experiments before we begin our data collection in our actual in vivo imaging team. And here from a given field of view where we've got a group of cells will return to those same cells across many days, usually three days to collect our full stimulus set. And then we'll collect as much data from a mouse as we can from different fields of view. When we've finished collecting the data, the animal is euthanized and then we do serial imaging after post hoc that allows us to reconstruct the volume of the brain that we can use to make sure that, for instance, the genetic tools that we had were expressing the indicator in the right population of cells and to make sure that the health of the brain was good quality. This standard pipeline enables us to collect a lot of data, but it also has the benefit that it allows us to establish and kind of enforce strict quality control metrics. So at each stage of our pipeline, we've got, we assess various metrics for making sure that the animals are healthy and that the data that we're collecting is good data that we can use. So this Sankey plot shows you the mice that have entered our pipeline and where they fail out of the pipeline for various reasons. And a lot of these have to do with animal health, but some of them have to do with kind of the data integrity. And so you see that there's more failures at our imaging stage than at other stages where issues with the microscope could cause failures, maybe for just a single imaging session, which we can repeat. But if there's too many of those types of failures for a given animal, then we'll remove that animal from our dataset altogether. And so the result is after running this pipeline for about two years, we've collected over 1,400 hours of imaging from 456 different fields of view, different groups of cells. And so this is obviously a large amount of data. And you can think of these as all into separate movies, kind of like the movie that I showed you earlier. And that's really exciting. But for most of the ways that we want to work with this, with this data, the movies aren't what we want to work with, we want to think about what those individual cells are doing. And so we developed now an image processing pipeline that takes all of these pieces of data and extracts the information that we're looking for. So the biggest extraction involves the calcium movie where we want to identify the individual cells and extract the fluorescence traces for each of those cells to pull out the activity of all of the cells in a given field of view. And so we have motion correction, cell segmentation, demixing algorithms that give us activity traces for each cell. You can see this plot. This is across the entire session, the activity of about 50 of the cells in one experiment. We use our information about the stimulus so that we know what stimulus is being presented when and so that can be temporarily aligned with the activity, the information about the pupil location, where that pupil is pointed on the monitor, as well as the running activity of the mouse. And all of this gets integrated. And we can bundle this data together into a single file. And we use the neuro data without borders file format, which is a standardized file format for this type of data that kind of makes it easier to share and use these data across groups. And so all of this information gets put into these neuro data without borders, files. So this kind of summarizes the data that we've collected. As you can see, in total, we've got over 63, recordings from over 63,000 neurons from 456 different populations. And so some analyses focus on individual neurons. Other analyses think about the interactions of these cells within a population. As I mentioned earlier, the data is collected across lots of different cell types where we've limited our expression of our indicator to particular types of cells in the cortex. And we've collected data from six different cortical areas. So you can see how that data is broken down here. And as we start to kind of dive into the cell, we see that there's lots of different things going on with these neurons. These are visualizations that we have that I'm not going to unpack, but they summarize some of the responses that we see to the different stimuli that we show. And you can just see that there's a lot of data here. We've got 63,000 cells, a lot of different stimulus responses, and there's a lot that can be done with this data. Because we make our data openly available, there's, you know, we can do a lot of analysis of this, but this is really, we want this to be something that the community can use and does use. And so this kind of brings me to the next stage where I want to talk about how do we effectively share this type of in vivo physiological data. So, and I want to talk about the different levels of data that we collect and that we use. And so with the calcium imaging, we start by collecting these raw movies that I showed you earlier of that we're collecting through our microscope where we have lots and lots of cells that are lighting up at different times. And these movies are really, they're great to watch, but they're really big, right? It's about 60 gigabytes per movie. And if you extend that across, you know, 1500 hours of imaging, that becomes a really large data set. And so some of our users might want these raw movies, but that's really hard for us to share because of their big files and they're kind of clunky. And then in addition to the movie, there's a lot of auxiliary data files that need to go with them that provide the synchronization information and the stimulus information. Those movies go through stages of processing that I mentioned before, where we identify the, which, where cells are located in the movie in order to extract the traces of activity for individual cells. And we do other kind of annotation processes where we de-mix, like if they're cells that are overlapping, we can de-mix those signals to separate those out. We do our temporal alignment so that we can have all the stimulus information integrated with the activity traces. And then we can also summarize these responses. We can compute metrics about receptive field sizes or tuning properties of these neurons. And so when we think about what data we want to share, we could go all the way from our raw data, these big kind of awkward files that require a lot of data processing before they're useful, we can go all the way down to these derived metrics, which are pretty easy to share. We could have a simple spreadsheet with 63,000 cells and we could compute hundreds of metrics for them. And that's very easy for us to post somewhere to put in a repository to email around. But what really is the most effective and useful level for sharing this data? So there are some use cases for these derived metrics and visualizations. And as I mentioned, it's in many ways the easiest data to share. And in fact, on our website, this is pretty much what dominates the data that we have available on our website. And this will take you to the website for this particular data set. This is from the landing page summarizing the data that we've collected. And we've got visualizations summarizing each experiment that was collected. So giving us kind of summaries of the populations that were recorded in each session with summary statistics about, say, the eye position or the running speed of the mouse, as well as summaries of various response metrics that we can compute. We also have visualizations for individual neurons that provide tuning properties and derived responses for each of the 63,000 neurons. And you can dig into those individually and start to actually interact with this more and get a better sense for how individual cells are responding to the different stimuli we show. This is really useful for exploring the data set and understanding the data that's available and sometimes for maybe identifying a particular session or a particular cell that you want to focus on for your analysis. But in terms of doing actual rigorous analysis, this isn't really the tool that most people want to engage with. And so where most of our users actually want to work with the data is with these time-aligned activity traces that we have packaged in these NWB files. And so this is where the software kit that we have, the Allen SDK, provides programmatic access to these data that enables users to download the data. It keeps the data organized and then it allows them to access all of the contents of our NWB files. And then it has some functions to kind of begin analyzing the data that people can build off in their own directions. But there are some people who do actually want to use those raw movies. And part of that is because a lot of the processing that we do, moving from raw data to derived metrics, is still an area of open research. And I think this is maybe something that's different for different modalities of data that are being shared where the community has different ways of doing segmentation or doing demixing. And we don't yet have a consensus as to what the best or the right way to do it is. And so there are people who are working on those questions where access to those raw movies is more useful than access to these derived traces. And so for that, we've actually put all of our raw movies onto AWS. It's a public dataset on AWS. So there's something like 82,000 terabytes of these movies available there. But because they're so large, we needed to partner with them to make that really easily accessible. But regardless of the level of data that people want to interact with, one of the key things that I think is most important to point out is the need for documentation. So we have a platform paper that describes our dataset and some of the analyses that we've done with it. And this is a standard scientific paper and it focuses on the scientific results. But the information in this paper is not really sufficient for people to really access and use the data effectively. I mean, this is something that one of our speakers this morning kind of touched upon. And so in addition to our platform paper, we've got white papers up through our website that document how the data was collected and processed, what the transgenic tools are, how we processed the data and analyzed it, that really enables users to better understand what the data is and how they can use it, as well as tutorials that we have that show how to use the SDK to access each of the pieces of data so that they're able to actually work with it. This is much more detailed than the methods that you'll find in our scientific paper, even after, you know, like there's the methods we submitted and then the methods that were published and, you know, what users actually need is a lot more than either of those. But in the few years that this has been available, there've been about two dozen papers, some of them from within the Institute, but others from outside of the Institute that people have used this data. And again, at different levels, right? Some of these papers look at questions of segmentation, whereas others are looking at questions of stimulus coding in this data set. So to wrap up, I've gone a little over time, I apologize for that. I want to just mention in the last year, we've released a second data set that kind of parallels our pipeline where before we've been using this calcium imaging, these now use high density neuropixel electrodes where we have six high density electrodes that record spike time precision data from both cortical and subcortical areas. And so we use the same stimulus and the same pipeline, but now we've got better temporal resolution and access to different areas. So all of this data is available and I've already mentioned our website with our links for the software kit and the documentation. We've got a forum where people who are using this data can ask questions when they run into trouble. And with that, I want to thank you for your attention. I want to also recognize the founder of our Institute, Paul Allen, for his vision, encouragement and support in generating these large open data sets, as well as all of my team members who have contributed to this data. So thank you. Great. Thank you so much, Saskia. If anybody has questions, please send them to me in the chat, but I actually have a question. My background is in neuroscience and I just find it really striking the number of mice you're able to acquire data from in this study. I know this work can be slow. And I was wondering if you could comment on the team-based approach that Allen uses and how that impacts the efficiency. And if you think that neuroscience as a field would benefit from more of this team-based type of work. Yeah, I think there's ways in which it really helps. And then there's ways in which it maybe isn't as helpful as it sounds like it would be. And so I think getting it built up took a lot of time. And so I think I know a lot of academic labs that could kind of crank out these experiments while we're still kind of building our teams and training people. But at the end, we end up having this nice pipeline where we've got a team of surgeons that are really good at surgery. And so we can every week we know we're going to get a certain number of mice are going to enter a surgery and we have estimates of how many are going to come out. There's always some variables because there is quality control that goes in there. And so it does help to have these dedicated teams. At the same time, I think people get bored doing the same thing. And so one of the things that we found recently is that it kind of helps to have people that bridge some of the different teams so that they're learning different skills and they're able to kind of get exposed to different aspects of the work. And it I think improves morale and which in the long run I think improves our throughput and productivity. So I do think it's useful and effective, but there are some things about the pipeline where you do have to have some things very rigid and it makes things hard to be flexible. And in a lot of academic research, you need to be able to change things very easily. And our system isn't really suited for that in its current iteration. I think it could be enhanced for that. Great. Thank you so much. It's interesting to hear both the benefits and challenges of that type of approach. We actually have one more question we'll do now and then we'll move on to our next speaker after that and take the rest of the questions during the panel. But the next question is from Alex. And Alex, please feel free to unmute yourself if you'd like to elaborate on this at all. But the question is how supportive is the Allen Institute of all the time and effort spent on that extensive documentation, tutorials, and software? A common reason this kind of work doesn't get done is that it doesn't count for tenure in the academic system. Yeah, this is a great question and I think a really important thing to hit on. And so part of the thing is that because this is such a fundamental part of our mission at the Allen Institute, we've got teams of software engineers whose job is developing our website, is developing our software kit, and maintaining those tools. And so we've got people who spend time and it's part of their job to help users and to develop our tutorials. And this is, I think, one of the challenges that I think in academic research in particular needs to start thinking about is how do we lower the barriers to putting data out into making data openly available in accessible ways where we're not asking the labs that generate data to also take on this software engineering role and to take on user support roles. Because it's putting a big extra cost on the data generators in the labs that are making the data that isn't being shared equally across the community and is going to slow down their progress and is going to slow down things like tenure. And so I think as a scientific community we need to think about what are the incentives and tools that we can put in place to make this a lot easier and lower that burden and make it also more of the norm so that it becomes not only expected and accepted but recognized and rewarded. Great, thank you. Yes, I still have a question came up at our conference yesterday. This is a really big issue, the fact that data sharing is often not part of the tenure process. Okay, so we will move on to our next speaker now. Our next speaker is Dr. Lex Crawfitz. Great, thank you, Saskia, for a great talk. Okay. You're seeing my screen? Yes, we can see it. Great, yes. Thank you for everyone for tuning in and to the organizers for hosting this in this difficult year. So my name is Lex Crawfitz. I'm an associate professor at Washington University and the topic I'm going to discuss today is another aspect of, well, in this case, small-scale data sets but kind of visions for making them large-scale and this is behavioral neuroscience. So we heard from Saskia about 60-something thousand neurons, which is incredible. When we talk about behavioral neuroscience studies, I'll show you data that we're often talking about many fewer subjects and this becomes its own challenge. I have no conflicts of interest. In the spirit of openness, feel free to screenshot or do whatever you want, share anything I'm showing. Nothing's confidential and the slides themselves are available at this link. So I'll start by talking about the idealized scientific process and I've kind of idealized it as a linear arrow but we all know that the process often deviates from this arrow in many ways. And we talk about, on each of these points, we can talk about improving it and improving the rigor or the shareability of each point. The point I'm going to talk about today is one that in behavioral neuroscience isn't talked about so much but it's the acquisition itself. And to remind you, I'm going to give you examples from my field but I believe some of them are relevant beyond into other fields. And there's two specific challenges that I'll bring up that I think that our lab has been trying to address and address with open source methods. And the first challenge is that a lot of our acquisition is done with proprietary hardware and that's not such a problem but it often results in proprietary software and data formats as well which can become much more problematic. So I've seen my share of exporting into formats, importing into other ones or USB keys going around the lab to different computers to move things along. It often ends up being a hurdle that's not really conducive to a fully automated or reproducible workflow. And then the second challenge which I hinted at a minute ago was that our results are almost always underpowered and not by a little bit. So we'll get to my opinion on why these behavioral studies are so underpowered but first I'm going to show you some data to make this point clear. This is a widely started article by Catherine Button and colleagues and it has a catchy title, Power Failure and they talk about how most of the published work in neuroscience and they do focus a lot on behavioral neuroscience. They talk about how most of this work is vastly underpowered and how it has devastating consequences for reproducibility. To look at one figure from this paper, here they analyzed I believe it's 122 studies and they found that only 15% had what they'd consider an adequate statistical power. So if there was an effect there to be discovered that would say that only about 15% were powered to discover that effect. And to throw out some real world numbers for a typical type of behavioral task they estimated that sample size should be about 130 animals. And the median in these studies they analyzed was 22 animals. So about six fold lower. So why don't we? This got me thinking a lot. Why don't we test 130 animals in our behavioral studies? And in my opinion the answer is that for most academic labs this just isn't feasible. So we don't have the equipment or the time to run this type of this scale of studies and even if someone gave us the equipment and gave us lots of money to hire people we wouldn't have the space to house that type of an operation. So it's really our infrastructure which is not compatible with high throughput behavioral studies. And what I'm going to show you over the next nine minutes or so is a different infrastructure that we can run that can allow individual researchers to run hundreds of animals in these studies. The main caveat is that this is a vision it's an ongoing vision but I'm going to show you the so we're I guess so we're not quite there but I'm going to show you the progress we've made towards this vision. And I started by asking how have other fields solved this exact same problem. So many fields at some point want to get scale up their throughput and they usually turn to some combination of miniaturization and automation to do so. For example in microbiology the advent of the well plate these are these photographs are these um plates where they where individual experiments can be run and it allows researchers to run hundreds or even thousands of experiments simultaneously on one plate. And you can see that this is a huge advance over running things in test tubes on one at a time which is what was going on before these plates were invented actually in the 1950s. So this has been a long-standing improvement in in microbiology. For us we have a piece of equipment also invented in the 1930s but still very much in use today which we call the operant box or the skinner box it's probably in my field one of the most widely used pieces of behavioral equipment. And the idea is that you put an animal a rat or a mouse in the box it can push the lever receive food pellets and based on how you program the box you can learn you can gain insight into how they learn complex tasks or how motivated they are. However these assays take a long time so when we when we put an animal in these boxes we might move the animal into the box have them be there for a couple hours move them out and when I look at the well plate it's almost like this red line of put where when we're done with this we filled up one well in the plate and we'd have to run these around the clock for months just to fill up a plate and I think this is really the cuss the crux of why our behavioral experiments are so underpowered this infrastructure just doesn't support high throughput. So we started thinking about how we can solve this this is a photo of a colony room if anyone is not familiar with rodent colonies there are rooms like this in every university and many private companies and what we're looking at here are four racks of cages each of these cages contains between one and five mice we're looking at probably about 600 cages about a thousand mice or so just in this one room so for several reasons including the space concerns that are present really everywhere we decided that the best place to do high throughput behavioral neuroscience studies would be in these home cages themselves the the cages are there the mice are there so we basically want to put sensors in the cages and test an entire rack at once that vision really is could this colony rack be like the well plate like the 96 well plate for behavior. So this is what I'm calling a home cage vision for behavioral neuroscience the idea of doing experiments in the home cage is somewhat distinct from many of the workflows that exist in behavioral neuroscience but what we wanted to do is start putting sensors in and so we started designing them we started designing them in an open source way mainly because we are not a company I'm an academic lab we felt that the open source way was it kind of distributes the manufacturer and the and the implementation to other labs if other people want to build it they can simply look at the plans and build it so we made our own version of an operant box which I'll tell you a little bit about sort of the like the box that we put the rat in except now we made a small device that sits in the cage we wanted to measure activity in the cage so we put out sensors on the ceiling there's a lot of interesting information in the environment so not just light levels and temperature but even some species of chemicals that the mouse might produce like carbon dioxide can be analyzed right out of the air and then we wanted everything again to be open source this is obviously an interest of mine but as me or other people come up with new ideas and new tests we don't want that limitation where everything's been developed in some way that's really hard to access so we wanted to make it open source so new tests could be developed and be easily integrated and then finally because this vision involves hundreds of cages we need a way to connect them all and it should be wireless because we're not going to have you know that colony room that I showed you with 600 wires going into each cage so I'm going to tell you about this vision and how far we got the first device I'll tell you about is called feeding experimentation device version three if this is our sort of small in in cage operant device I'll mention this is an evolution of an earlier device that was published by Katrina Newn in my lab in 2016 and Katrina is actually a now a VME student here at Carnegie Mellon so I don't know if she's on this call but I thought that was a neat thing to come full circle with this device is a wireless device this contains a lot of hardware it's based around a pellet dispenser just like the classic operant box it has nosepokes the mice interact with it has lights and tones you can learn a lot more about it at the github link including all the design files but really where it deviates from other solutions from the sort of more traditional solutions for this it says it's small I have one here it's sort of small it's designed to be placed right into a cage and it's quite a lot cheaper so we're building these ourselves and I mentioned this is not really a commercial endeavor so a lot of that cost savings is because people are putting in the labor themselves but it's about 20 fold cheaper than existing solutions which I think is important if we ever want to realize this vision of improving our power in these studies by six fold or ideally by more than six fold but that would sort of get us to the minimum this is a video of how I've seen it in action this is the inside of a cage and the mouse is just going to run over you can poke on the left poke the thing the device lights up and it's going to dispense a pellet for him that he can eat and this can sit in there and run around the clock we can program it in different ways to make to study different aspects of behavior I'm going to show you some this is some data from here here we're looking at seven cohorts of mice representing about 150 animals and we're looking at time from zero to 16 let me see if I have a this is the first at time zero is their first time they've seen this device at time hour 16 this is running overnight they've become quite experienced with it and we see that in all these mice the levels of pokes go up in addition the accuracy so whether they're poking on the right side goes up showing that they are starting to understand how this device works while all this 150 mice weren't run simultaneously as I mentioned in my sort of well played analogy it does show that having a cheap device like this even in a serial way allows you to achieve the type of power that kind of moves you more into the high throughput behavioral studies because the device lives in the home cage it also can do some other things that are that are typically not done so here we're looking at six days of data we're looking at nose poking around the clock and we see some some things that are expected like we see a nice circadian rhythm and some things that may make you think that there's quite a lot of variants based on where in the day the animals poking so it really tells you about the importance of understanding if you're running the animals in the morning or the afternoon so I can tell you a lot more about this but in the interest of time I want to show you one more device if you're interested in learning more you can check out this github link where it has all the design files and I'm aware of about 30 labs that have built it themselves and a few that are even now starting to modify the design which is really the promise of open source dissemination is that people don't just use the design you made but actually improve it for their own use so in the last minute to two minutes I'm going to kind of I'm going to tell you about our work transmitting this data to the cloud and we partnered with a company called MCCI an internet of things company to design this kind of take us through this next step what we designed was an in-cage wireless device that allows us to transmit both environmental and behavioral data to the cloud there's a schematic of it on the left I have a device right here just kind of holding it and then a photograph in a cage we've been calling this mouse rat and if you want to check out mouse rat.org you can sort of see an example of the types of data you can also reach out to me to get a little bit more in-depth I'm going to quickly show you a tour of how this works and then tell you about the implications leave I'm showing you a tour is that running okay I think that so we have these devices posting to an internet gateway this is a semi-commercial slash open source product called Grafana that we're using but it could go to many different dashboards here what we're looking at I'm going to is about 30 days of data if I zoom in on the end of or sorry the beginning of July we can see both activity data and pellet data and this is all posting in real time right out of the cages and what I think this does what this really does conceptually for this behavioral neuroscience is allow a place where the data can be aggregated it's automatically backed up it's very easy to share I can post give people links and they can automatically see my data and download it it also becomes a way that multiple labs could even have systems running in their own cages and post data into the same data set all very seamlessly so the big picture I'm going to end here and the big picture is this vision and I told you about a specific application in behavioral neuroscience but I think there's a concept here of moving the reproducible pipeline one step back from the analysis pipeline and really moving it to the data acquisition itself such that as animals complete these experiments as they eat each pellet even that data can be posted and enter analysis pipelines and even be updating conclusions you know in real time so I believe that having this pipeline can really facilitate reproducible scalable and also shareable science I'm going to end here I'd like to acknowledge all the people who contributed to the work I showed the fed three has been something that we've been working on since Katrina newan made the first one in 2016 and many people since then have contributed to it mouse rat is a device that we're working on with this company MCCI and I'd also like to recognize the people or the organizations that have funded this work which I'm very grateful for I'd be happy to take questions and certainly if you think of anything at a later date you can feel free to reach out to me thank you great thank you so much really interesting talk we actually have a question from Carly who is interested in knowing if you have any thoughts about how to increase the capacity of human behavioral data that can be collected and Carly please feel free to unmute yourself if you'd like to elaborate or clarify that at all so I'll mention that a lot of these devices there's nothing special about a mouse so I think for a lot of the human research I've talked with two with human researchers about two of these devices and I think the short answer is absolutely they can the whole workflow can apply to humans I'll say quickly what those were there's a group at UNC that does human obesity work where we're a mouse obesity lab so they're colleagues of mine but they do imaging studies and getting sort of different types of manipulations and video game type things into their scanner is can be challenging so we work with them to basically adapt the electronics out of this device you know to something that they could use in their imaging work and also be very open so they you know now they can tweak it however they want as opposed to relying on a company it also ends up being a lot cheaper so I think that can be a real that can be a real thing that if the price comes down tenfold it allows people to pilot with things or experiment with things whereas they may not have 10 grand to to drop on a company to try out a product great thank you um we have one more question but I think I'm going to save it for the panel in just a bit because it seems like it would be applicable to more than one speaker great um okay so with that we will move on to our next speaker thank you Lex thank you okay our next speaker is Dr. Marina Sarota she is currently an associate professor at the Baker Computational Health Sciences Institute at UCSF and prior to that she worked as a senior research scientist at Pfizer where she worked on developing precision medicine medicine strategies and drug discovery so with that I will let you go ahead Marina excellent thank you so much and thank you so much for the invitation let me share my screen let's see does this work yes okay and you can see my regular view not the presenter view is that right yes looks good perfect so um hi everybody my name is Marina and I'd like to tell you today a little bit about our efforts in data sharing in the context of our pregnancy outcomes and specifically pre-term birth so definitely shifting gears from the last two presentations which were excellent so uh I always start by saying why I'm I'm excited to be a computational biologist today and some of those reasons are that there's tons of publicly available data uh that we can query and analyze in different and creative ways and ask new questions about disease and that's what my lab is focused on and these are just some examples of those resources many of you are probably familiar with the gene expression omnibus which captures pretty much every transcriptomic study microarray RNA-seq and even single cell RNA-seq that we have available now um there's clinical data in and clinical trials data in in databases like import the cancer community has really figured this out really well with the cancer genome atlas which is a phenomenal resource of over 10 000 cancer samples that have been profiled with a variety of different technologies and the data is all organized in a way that's very digestible but nothing like this that exists for pregnancy outcomes or existed until we started working on it and that's what I'll focus on but also I wanted to say in addition to the publicly available resources people are starting to come up with creative ways to use technologies as these technologies are getting cheaper and cheaper so transcriptomics and genomics have been around for a while but now we can look at the antibody repertoire at epigenetics the microbiome proteomics and of course many of these we can now measure at a single cell level as well which adds a whole other level of complexity to the data so my lab is really interested in methodologically in figuring out how to integrate these diverse molecular measurements also together with clinical data that I'm now going to talk about much today but these are the analytical questions that we're interested in but again what I wanted to talk about today is our work relating to pregnancy outcomes and specifically preterm birth for the for those of you who don't know preterm birth is defined as live birth before 37th week of gestation and worldwide about 15 million babies are born premature every year about a million of these infants die within the first 28 days of life and in many cases we don't know what was the root cause sometimes early delivery is indicated medically indicated for instance in situations like preeclampsia but in many cases we don't know there's been a lot of work trying to look into various factors and mechanisms that might be associated with partuation especially early partuation and this is a figure from a very comprehensive article by Roberto Romero and it basically allows us to see the complexity of this condition so there's many reasons why a woman might go into labor including stress the breakdown of the maternal and fetal immune system balance there's some mechanical issues like cervical disease or short cervix there's vascular conditions and of course there's infection and all of these are interplay in different ways and it's really hard to understand what exactly might be a cause of preterm birth in a given clinical case so to try and mitigate this and study this further we have been asked by an organization called the March of Dimes to create a preterm birth data repository there are currently six March of Dimes transdisciplinary centers and these are all over the US and there's actually one in England as well and the idea is to bring researchers from different backgrounds and work on this problem of preterm birth together in creative ways and as a result of course everybody generates a ton of data so the data are very diverse there's transcriptomics genomics microbiome proteomics immune measurements methylation metabolomics all of these different measures and the idea was to create a repository that would allow us to house this these omics data to enable new scientific questions to enhance collaboration and coordination among centers but also with a broader community and accelerate the pace of discovery so we launched this effort in 2017 actually so the database was launched in 2017 and we also had a paper in scientific data in 2018 describing the the repository so you can look at that this is the link to the repository now but we're actually just updated and there will be a bit more information about the new this is actually a screenshot of the new version of the database which is completely standalone so we have 31 studies capturing 14 different types of measurements we have about 10,000 participants and 20,000 experimental samples we also track the number of downloads and we can see whether people are using these data and they seem to be in the last three years so since 2007 the type of data that's captured I mentioned there's a number of different measurements the majority of the data is microbiome but we also have RNA-seq and cytop and then in terms of the study design some of these are the individual studies and this is the number of samples for subjects so some of the designs are longitudinal sampling and others are cross-sectional this is what the new repository looks like so these are the various studies you can query them and visualize them by assay type by biosample type so for instance blood versus placenta versus something else which center they come from as well as the condition the majority of the samples are from pre-term birth but there's also pre-atlampsia and other pregnancy related outcomes there's a resource page as well we have a clinical database called using red cap and that one unfortunately because of little limitations on the clinical data cannot be shared with the public but it's available for the individual centers and actually last year we were able to launch a dream challenge leveraging the transcriptomics data from the repository and this I'll show in a second this work is on bio archive currently we had a number of participants and it was actually a huge success and we're currently working on a microbiome dream challenge as well and also we have some software programs and resources that we share through this resource page this is the screenshot of the bio archive paper please check it out hopefully it'll be published soon we're working on it so to summarize the database piece you know we launched the site in 2017 these are the statistics 31 studies about 20 000 samples very diverse data we're also working on aggregating publicly available data and so far have captured 16 studies which together add up to almost 25 000 samples and as I mentioned before the manuscript paper was published a few years ago okay well this is great but really we want to utilize this data and that's what my lab is focused on and we really have taken a multi-omex approach to preterm birth these are some examples of things we've worked on we've worked on looking at the genetics and environmental factors that might be associated with preterm birth and these are the very talented individuals in my group who carried out the studies we did a microbiome analysis and this was work led by Evie Kosti who's currently postdoc in my group and also published earlier this year we've done work on the transcriptomic side I've mentioned the bio archive dream challenge but we've also done some of the transcriptomics work ourselves that was a project led by Bianca Bora and also published in frontiers immunology a few years ago and then currently where as I mentioned before we're looking into that electronic medical record to see whether we can build predictive models on the medical data and actually transfer these models from one institution to another so this project that for my team Brian Lee and indeed have been working on the idea is was to collaborate with Vanderbilt University build predictive models in one institution to see if we can predict women at a higher chance of preterm birth and test them in the other institution but the story that I'd like to focus for the last few minutes and tell you guys a little bit about is actually a drug discovery story so we were interested in asking whether we can use transcriptomics data for computational drug discovery in the context of preterm birth and the initial method the pipeline was actually developed a number of years ago actually as part of my graduate studies at Stanford and what it does it is uses a pattern matching strategy to identify new uses for existing drugs or drug repurposing so let's say you have gene expression data from a certain disease in our case preterm birth and you have gene expression data after before and after treatment with a certain drug we want to link the drugs and diseases using that gene expression as a proxy and of course all the expression data on the disease comes from the repository that I was mentioning so the hypothesis is if a gene profile in preterm birth is reversed by a drug then maybe that drug could be used as a therapeutic and we've had previous success applying this pipeline to a number of different conditions a lot of cancers but also some autoimmune diseases as well to identify the transcriptomic signature of preterm birth we took a meta-analysis of three studies and this is based on the VORA et al manuscript from 2018 and we identified 210 genes that are differentially expressed if you see at the top of this heat map light purple individuals are women who delivered preterm and the individuals in dark purple are women who delivered at term you can see that there's the subcluster of individuals in the middle that's enriched for pre for moms who delivered preterm but this clustering is not perfect when we looked at the pathways we saw that a lot of immune pathways were misregulated here so then on the drug side we leverage a dataset called the connectivity map it's a publicly available dataset that contains expression from cultured human cells before and after treatment it's a genome-wide dataset so 22 000 genes data on about 1300 drugs and 159 genes from our previous signature the preterm birth signature were captured in this database so then we go on to compute these reversal scores so given the disease signature we query the drug data and for each one of them we come up with a gene reversal score and you can really think of the score as a correlation so we want to look at the negative correlation or the drugs that might reverse the gene expression signature of the disease and we use a rank-based approach for this as a result we got 83 total drug hits that were significant adjusted p-value less than 0.05 including progesterone which is the only FDA approved compound for pre-term birth so then we go on to specifically look at the drugs that come from pregnancy categories A and B which are the most safe the safest FDA categories that are the safest in pregnancy so this is a visualization of those data and in this network you can see that progesterone and lansoprazole another compound share a lot of targets including the SIP 1B1 which has been previously the mutations in this gene have been previously associated with pre-tember there's a number of other candidates here that are interesting including metformin folic acid and contrimazole some of them have been investigated in animal models before but we wanted to pick something new so we focused on lansoprazole so why lansoprazole it's available over the counter it's very safe and one thing that we know is that it's able to induce the stress response in heme oxygenase 1 previous studies have shown that expression of H01 has been shown to reduce pregnancy loss and we hypothesized that targeting this mechanism might have some efficacy in preventing recurrent pregnancy loss so we started a pilot validation study in an animal model and this was an LPS induced inflammation fetal wastage model where the animal's pregnancy was confirmed at E0 and then LPS was induced at day 7.5 the drug treatment both progesterone and lansoprazole took place around that time and at day 12 and a half the viable fetuses were counted we looked at three dosages of progesterone as a positive control and also lansoprazole and then we hear the assumption of course none of these animal models are that representative of actual human preterm birth but nonetheless it's a start so we hypothesized that prevention of this pregnancy loss in this inflammation based model might show some potential to prevent preterm birth so this is what the data looks like this is the animal model being induced so we looked at just LPS, LPS plus oil and LPS plus DMSO in comparison to healthy animals and you can see a significant reduction in the number of viable fetuses across all three groups so of course there is a lot of biological variability however when we look and compare to progesterone which is shown in blue and lansoprazole which is shown in purple we can see almost a complete reduction or reversal back to a normal consistent number of viable fetuses with treatment with lansoprazole and this work was done in Dr. David Stevenson's lab at Stanford University by Ron Wong and Sori Watani so we're very excited by this work we wanted to share that with you guys as an example of ways that publicly available data can be used to actually learn something new about disease therapeutics so as a summary what we've done is we've built a database for preterm birth research enabling this data sharing and then we use the data in the repository applying computational approaches to identifying new compounds effective for preterm birth we validated lansoprazole but also are looking into other candidates and also trying to understand the mechanism of lansoprazole a little bit there with that I would like to thank my team the folks here Gaia Tamiko and Brian so Tamiko is co-leading the database effort with me Gaia has been doing the dream challenge work and Brian has been leading the drug repurposing as well as all of our collaborators at Stanford the group at north of Grumman helps us with database development the dream challenge group and the larger March of Dimes community of course as well as March of Dimes and all of my other funding sources this is my team in February we all bunched up into one elevator to go out to lunch and of course now we look like this on zoom just like everybody else so with that I would like to thank you for your time and I would love to take any questions great thank you so much Marina um if anybody has a question for Marina feel free to send it in chat or you could also raise your hand and unmute yourself I think in that case we'll move on to the panel because we have some questions that can be sorry oh we do sorry I was sorry I couldn't unmute my sorry I couldn't raise my hand because I'm a host and I couldn't type fast enough but I do have a question so it's a great effort like maintaining the such a big integrated database and with all the all the different data sources and so I just wonder how much effort comes from your lab just in order to maintain the database and also to explain data to others yeah when they try to use it absolutely that's a great question so in terms of maintenance it's me and a senior research scientist Tamiqa Dr. Tamiqa Skotsky um we the biggest hurdle I would say is actually data curation putting the data into the repository and I'm incredibly lucky to partner with the north of Grumman team who have a lot of experience with data curation they actually manage the import database and they've been helping us with some of those efforts so I would say you know more the scientific efforts are part of my team and then the back end database development and study curation we get help from the north of Grumman team and we work very closely together so for us we identify the studies that need to that are interesting and relevant to be incorporated into the repository and then we work with the PIs as well as the um the north of Grumman team to make that happen or for instance when we redesign the database a lot of that was you know decisions made by us together with north of Grumman so we work very closely together with the technical team and then in terms of using the data I mean people are using it they sometimes reach out to us for questions but they also probably reach out to the PIs who've generated the data awesome thank you sure great we have another question here um and the speaker can please feel free to unmute themselves if they'd like to clarify at all but the question is how common is this kind of drug repurposing through gene exploration sorry say that again was that for me sorry oh can you hear me now yes I can hear you I'm sorry I missed that for a second oh sorry I think I froze um the question is how common is this kind of drug repurposing through gene exploration so it is actually fairly common we developed the initial method for this in 2011 and since then it's been applied and used um in a number of different diseases there also is a number of startup companies that are doing maybe not exactly this sort of gene reversal um analysis but something similar a lot of people use network approaches and I didn't have time to talk about this but most recently we've actually applied this to COVID-19 and we'll be presenting that at the ASHG conference in a plenary session actually next week so I think there are situations where you want to repurpose drugs and figure out you know a good and you know consistent way to do that and this is a way to generate those hypotheses of course there needs to be additional testing both animal model testing and also of course clinical trials to show efficacy but because these drugs are FDA approved there is a little bit less of a burden on showing safety I don't know if that answers the question great and we have one more question for you and then we will move on to our panel Irene if you'd like to unmute yourself I'll just let you go ahead and ask your question sure hi Marina great to see you that was great I was wondering you when you put up the interaction network you showed that progesterone and lansoprazole were interacting with a lot of the same proteins but it seemed like from your results that lansoprazole actually worked better than progesterone so I was wondering if you've thought about you know do you think they're working through the same mechanism and what differences there might be that's a great question we haven't tested the mechanism but despite the fact that they share some targets they don't share all the targets and then the other point is that the reversal score from for lansoprazole is considerably better than progesterone right that's why we were excited to try it because it was something new hasn't been tried in the context of preterm work but also we had at least some evidence that maybe it's doing something relevant but maybe it's doing more but the mechanism is something that we really need to investigate and one area that we're exploring now actually in collaboration with Gary Nolan's group and site of techniques is looking at specifically human cells and what these drugs do in human cells relevant so let's say we have blood from a pregnant woman and we look at the immune measurements before and after treatment with a compound to really try to understand the mechanism beyond what we can see in cancer cell lines which is the data that we use for prediction so absolutely we definitely want to look into that we just haven't gotten there yet cool thanks sure okay at this point I'm going to invite Lex and Saskia back to our virtual stage here for a panel conversation with Marina and we actually had a question that came in for Lex earlier but it's possible our other speakers would like to share their thoughts on this as well Eric if you are here you can just unmute yourself and ask what structly fine and thanks everyone my question I guess is really applicable to everyone um for Lex specifically how does the sharing of the data acquisition the step you added at the end how in real time or not how does that really improve your science or science in general and I guess to the the panel if you could help make it clear just what the openness of your approaches really adds for you that be really appreciated yeah I'll um I'll go first on that one so I assume you're talking about the posting things in real time how does that improve the science yep so the the when you're getting to think about a hundred mice in a group or something when we do things with other methods whether it's moving usb keys around or copying files and thing you start to run into a bottleneck pretty pretty quickly once you get to about a dozen it gets annoying if you're doing anything manual so I think the pushing a hundred at once to a database that's there is kind of a critical step for doing anything high throughput without kind of going crazy and having a lot of human error and moving it around I think it also has a lot of it opens up the possibility which we have not yet realized but it does open up the possibility for things like multi-site studies so instead of me doing an experiment and saying hey here's what I think is a good approach I might call you and we could have the same equipment sending it into the same database or ideally you know six or seven sites which is something that's pretty routine in clinical studies that they synchronized across sites and almost non-existent in our rodent behavioral world so I think there's some some things are just kind of convenience right now but then there's some more visionary things that you could imagine or I would hope come to pass in the future I'll chime in I think kind of the broader question that you're asking Eric is really is really important and really interesting right like how does how does making data open affect the science that we're doing and in many ways I think largely it doesn't right like if we're being asked by publishers to share our data and so we share our data like great but it doesn't necessarily change anything about the way we do science or or what comes out of it and so I think this is kind of one of the questions that I think as a scientific community we should think a little bit more about because otherwise we're just creating a lot of big data files they're taking up a lot of space to maintain and you know to store somewhere and so what is that big advantage and one of the things that we've talked about as a community is like the reproducibility crisis I'm not even convinced it really solves that but I think it could right I think we could have some places where by sharing the data we come to understand more about for instance quality control and maybe people that are trying to use my data that run into particular issues might bring to light some of the you know some problems that might come through how we've processed the data or how we've collected the data that then allows us to kind of improve on that iteratively or better understand some of these processes so like I do think that there's there is some benefit and space for that but I think until data reuse becomes a norm it's not going to have a big impact on on the science it's it's just going to be a way to kind of validate that we've done things you know we're playing by the rules and so I think there does need to be a bit more of a cultural shift at least in the data modalities that I work in in terms of what it really means to share that data and for that to really have an impact in the in the for the science. We'll say one more thing that I think we experience is often it takes a long time there's a long lag between acquiring the data and getting to the part where you've made a conclusion and by the time you get to where you're you know you have your answer so much time has passed often the experiment cannot no longer be changed you can do a new experiment so if we could shorten that lag time to where certain metrics can be pulled out even the same day as the experiment or right away it could allow things to turn around a little more quickly and kind of speed up the increase the speed that of of doing new things I'm thinking about for us with the mice the ideal is that you don't have to then start with a whole new group of mice you know maybe things that you could realize okay we have one more experiment to do and we realized it this afternoon as opposed to two weeks or two months later when then you're kind of back investing new resources great and marina do you have anything you'd like to add to that and I think I think the other panelists have covered it pretty well okay great are there any other questions um you can send me a question in the chat or just also raise your hand just going to give this a minute to see what's coming through okay we do have a question here um from alex alex please go ahead and unmute yourself hi i'm sorry my husband is also in a conference behind me so that no worries may um generate a little extra noise but um my um my question was about the um about all the the videos and stuff I guess this is it's it's sort of overlapping with the last question or which was about what's the reuse potential I'm I'm curious because you mentioned other researchers maybe are downloading your videos and using them but it seems like that research is really specific so as a librarian I'm wondering what is the reuse case there because it's something that I'm asked fairly often when I advocate for data reuse and specifically because I'm interested in microscopy and things like that um people are like well is anybody really going to reuse other people's microscopy images or things like that um and then the other question I had that's related is um for data where it's non-tabular for data where it's actually photos or videos that's potentially copyrightable and I'm wondering if you have dealt with licensing issues at all around that fact it's not potentially copyrightable it's automatically under copyright in the U.S. at least so I can speak a little bit for um for our data so the the reuse case for the movies for us um is that the the methods for extracting the identifying cells and extracting their activity um is is an area of active research this image processing um stage um and so you know when we started actually our pipeline there were no um there were no available methods for that and if you looked at the literature um the way that people were doing that was manually right where a graduate student was was looking at the movie and you know using image j to draw um some sort of polygon around putative neurons um and there are now a few software packages that have come out um but there is no single standard in the field um the different packages you know will yield somewhat different results um and so it's an area of active research that a lot of people are are are still working on um and probably will be for a while and so you know we maybe don't need 1500 hours of movies available for people to continue working on that but having some movies available is definitely um really valuable for that effort so that um people can benchmark the different methods on a say the a single dataset for instance or compared to their own data sets right because every microscope is is a little bit different um and so seeing how robust it is to different imaging pathways and things like that is is useful um so that's the big use case for our um for our movies um that we know about and that we know several people that are using it in that way um the question about copyright I honestly don't know um we have like a legal team that deals with the licensing and again this is where like being at a place where this is part of our core mission makes it a little bit easier for me as a researcher to not have to think about it um because I just forward anything to the legal team and they deal with it so but that's a good point to bring up anyone else have any thoughts on that um we have time for probably another question if anybody would like to raise a hand or put a question in the chat could I ask a quick one to ask you um so when you doing everything in house allows you to do a lot of things really it streamlines a lot of things um do you have any plans to allow external people to contribute to these repositories um we don't um I think it's it's a really kind of a great idea um and I think um I mean in many ways like what it doesn't need to kind of go through somebody it doesn't have to go through say the Allen Institute to do that right um as there are new repositories for this type of data um that's essentially serving that purpose right um and I think the thing there is I think the challenge is is making sure that data well I guess I don't know that you need to make sure but I think it's the quality control step right where people are generating data and for it to kind of be usable you either need to know that it needs some certain standard of quality control or have quality control metrics kind of embedded in the data um included in in the data so that users are able to assess that themselves and I think um that's kind of an important thing um for considering alongside metadata and documentation is is having kind of those kind of agreeing on some standards for for QC for a given data modality um but yeah we don't necessarily we don't currently have any visions for that ourselves um but I I think it's a great area and I think as more repositories develop so for instance dandy is the is an NIH repository for physiological activity I think there's opportunities to to kind of create a more robust um centralized data set like that so cool thanks and I think I saw another question Brian do you have a question if so you can unmute yourself I'm sure yeah I um Lex was basically going to ask your question to you in that wouldn't it make just as much sense to coordinate these massive cohorts of you know whatever your experiment is um across labs as opposed to trying to have one lab do everything so yeah I think in that sort of the vision and I think it'd be great to come to a vision where you could imagine a you know a group of people that decide um you know this is a good experiment to be done and we're gonna all pool resources and and generate a really high quality data set um kind of like what the Allen Institute is doing in various ways but if you could imagine no no academic lab could do that but maybe 10 of them could together um come up with so I think that would be great I think we need a lot more infrastructure to get there um so I kind of imagine to keep working on these types of things and keep pushing them and somewhere along the line I'm hoping there's a point where I could say hey you know we can coordinate now that enough people are sort of posting data in real time what if we all tried to send it into one database um and come up with an experiment that we all agree is an important step to do in a distributed way um so yeah I feel like these are the baby steps to getting us there awesome thank you and I believe Wajin had a question uh yes thank you uh so I guess now I'm I'm wearing my librarian's hat here and asking this question so I um I guess a while ago I worked with some librarians to come up with the data curation primer for microscopy data just you know tell the teach the data curators in the repositories how to what do you look for when you take in this data what's the minimum requirement and what's the like the best practices so my question is what do you first is do you think this type of effort is valuable uh in terms of your specific data um and also what more can we do so I can say that curation efforts are incredibly important for the data sharing efforts that I'm leading because if the data is not well curated it won't be other people will not be able to use it very well so actually there's a lot of time that's being put in in my case with a north of grumman group curating our datasets so I guess I want to add more a little more to our to my question is like often like librarians are not necessarily not familiar with the data you're working with right so how do we like work do better I guess that's a good question I mean understanding the assumptions of the data and how and working very closely with the PIs or the whoever on a team are managing data to get it into a format that is usable is something that we spend quite a bit of time on I don't know if that answers your question but for sure I think it's it's critical yeah I agree it's absolutely critical and it's probably a place where there needs to be a tight partnership between the scientists and in your case librarians or in other cases software engineers but I think there definitely needs to be kind of a tight collaboration at least in establishing that process and maybe once it's established it can kind of continue on but I think that's one of the places we put a lot of energy into that kind of gets overlooked is that that curation that curation step so okay well if anybody anybody have anything else to add to that okay great well we're just about time anyway but thank you so much Saskia Lux and Marina about your work with data sharing and these large data slots it's really fascinating