 So I guess we should get started with our last session of the day It's a last session but not the last activity of the day because we are off towards we're gonna still have Reception and poster and demo presentation. So stick around our first speaker of the of the session is Marcel Boucher, a professor of chemistry and Biological Sciences at Carnegie Mellon University, and he's also a director of molecular biosensors imaging Center the talk the name of the talk is Sharing physical resources methods and beta-optimated research You know it's science. We're all about revisions. So the title is already revised So I'm gonna talk a little a little different I am You know new to the data generation game My lab is a resource generation lab. And so we actually have to share Something more than electrons or magnetic bits We actually have to share physical resources and come up with ways to do that and over the past few years That's been dramatically transformed and in ways that are really powerful. We heard Lenny mentioned some things like ad-gene so I'll mention a little more about that and sort of give you a sense of how we've gone about really trying to get the tools that we develop outside of the lab Let's see if this works Nope. All right this so our lab develops tools and Tools are not an end in themselves, right? It's great to write a paper that says look at this great tool. We developed and look we can do something useful with it but the point of making a hammer is that it is an enabling device that allows people with the right knowledge and Information and resources to do something really incredibly productive like build a house But you don't just give somebody a hammer and say go build a house and The same thing is true in Science you can't just sort of create a tool and stick it on a website somewhere and say, okay tools done Go ahead and use it right it requires know-how and training and support and in particular the tools we develop we're a sort of chemical biology lab so the work that we do sits at the interface of Genetic and protein-based reagents chemical reagents cell lines and model organisms Devices and ways of using those devices to generate important biomedical knowledge and Out the other end of that data and so you know this is kind of a more than just a hammer And more than just looking for nails It's important that The knowledge for how these tools work together and the availability of the appropriate tools is maintained for users to be able to use The tool kits we develop with taxpayer money on an ongoing basis And so how does this get built sustainably? You know the traditional answer to this was it's commercialized somebody starts a company or molecular probes, which was a company that existed Before it was acquired by in vitrogen and then you know gobbled up by life technologies is sort of aggregation of tools companies in the life sciences It used to be that scientists would publish reagents in chemical journals And dick hoagland the founder of molecular probes would read those journals and his company molecular probes in oregon Would go into the lab and synthesize these compounds and stick them in their catalog No matter how big the market was for those But you can imagine that what works for a startup company that's using this sort of You know talk about data parasites. This is a physical resource parasite model No matter how That gets that gets done, you know eventually You get acquired by a larger company and they start to look critically at Compounds that are only sold to one lab And so when molecular probes got acquired by in vitrogen around 2003 they started to scale back on their catalog and all of these resources that used to be readily available to labs that wanted them suddenly were out of the catalog Because they weren't commercially viable. They weren't commercially attractive as stand-alone products So there needs to be a resource to maintain these So that they're available And you know, how do you provide these tools to people, right? You just Throw them in a box and hand them the toolbox. You know Ikea worked this out for us You know, you can flat pack a bunch of stuff and you can stick an allen wrench in it, but you know the key to the Ikea use model Is that the instructions can be followed by amateur construction hobbyists Like me or my 13 year old daughter And so, you know, it's really important to provide users a roadmap And yeah, they could do that through the primary scientific literature But also by just making sure that all of the components that you develop and the protocols that you share are well explained and revised as they need to be And quality controlled and so a little bit of a product development mindset is actually good Even if you're not commercializing the product And so Lenny mentioned ad gene ad gene has been revolutionary for sharing biological DNA resources um We've deposited 24 of the plasmids that we developed in the lab at ad gene um And I would say that depositing the plasmids at ad gene really adds a lot of value to the resources that we've developed in the lab Because they sequence verify what we give them They document it consistently. They build these plasmid maps um using sort of the same Mining software to figure out what components are in the plasmids that we give them in addition to The annotations that we provide It forces us to document what we've put in those plasmids um aberrations from the reference sequence that we provide them are flagged And they double check those to make sure that we the developers view them as low consequence variations Do they exist in some region of the plasmid that isn't being Isn't significant for its function or did they show up someplace where it matters? In which case We go back to the beginning and we send them an updated plasmid that's corrected So that makes sure That it's not just junk out of our freezer that we're shipping out to people It's what we think we are sending is actually very well validated and tested They retain the stock and they distribute it under mta's that are required for some of the components that are integrated in there That saves us a lot of effort they track the users and You know to some extent the uses are cited Right, so when people use plasmids from adjean they get cited as adjean plasmid number seven eight five oh four And so those are sources that we can then go and look for to see who's using our tools and what are they doing with them And it costs the lab nothing I would say it's actually net profit to the lab It doesn't give money back to the lab It just doesn't tie up lab resources providing These things and supporting or troubleshooting the junk that you send out Because and I'll show the picture of the alternative in a second And then they also give you some sort of peer pressure and gamification Because there's this coveted blue flame award and my junior colleagues got there before I did but we're still working on that We've shared 140 plasmids with researchers around the world through adjean And if you think about that, you know from the lab, this would have represented essentially months of effort to maintain and quality control 24 different pieces of dna and bacteria To accurately put those into a tube To send them to the users to make sure that the users received them and that they do the right thing And so this really this is why I say ultimately it may be net profit for the lab because those are resources that adjean is spending on our behalf Sharing the resource the research that we have developed and the things that we've tested and validated and published on And you know, this is the alternative right shelves of random boxes that pile up because somebody goes in pulls something out You never know where the inventory is or you know, you keep A dedicated freezer that's got all these things in it Um But it's definitely More effective and efficient to have adjean Manage this these resources Using adjean we have distributed plasmids, you know all over the world And you know in lots of places where probably these researchers Might not have reached out to us by email saying, please send me this plasmid But it is painless for them to go to the adjean website and say I want that plasmid So the other thing that is a challenge for chemical biology and some tools in in biological research Is sharing of non renewable reagents the beauty of dna is that dna can be sustained and replicated It's sent out as bacterial strains And the bacterial strains when they run low they can just grow up some more and freeze them down I can't do that sadly with dye molecules I can't grow up some more dye and freeze it down. I have to send a chemist into the lab and make more dye So it's not a renewable resource like plasma dna or bacterial clones But the good news is that chemical synthesis is a vast scale relative to biological experiments So while we might make grams of something in the lab, that's enough for tens of thousands of Shared dye samples to go out to people and so we then Are the ones that are responsible for the quality control and quality assurance of that But because it's critical That the reagent be the proper reagent For the experiment to function, you know, we usually prepare extra and we share Um these as requested And this is materials transfer agreements between Carnegie Mellon and investigators that have requested dyes from us Um, and so you see these are not everyone that's requested a plasmid has requested dye And that's probably because some of the plasmids don't require the dyes that we've developed and some people are comfortable synthesizing their own dyes You know, they may call us and say I tried this experiment and it didn't work and then we start to troubleshoot and realize Oh, well, what dye did you use because we've never sent you dye And so we sort that out But you know, this is what our inventory looks like in the lab So we have a dedicated cabinet that we keep boxes of dyes in and those are packaged in individual aliquots um And we have labels that are informative and have sort of critical information Uh to share with people The key here is that an mta requires a discussion with the investigator and that is actually effort that is well spent And it's because it helps to understand what people are doing with your technology Some people don't want to discuss Their projects because of potential competition. This is something that's true for both data sharing and technology sharing Um, you do get to you know, the other side of that is you identify opportunities to avoid negative results Due to technical mismatches with expectations, right? Sometimes people get a certain idea in their mind that's just wrong and Maybe you wrote it that way in your paper by accident Um, but it's good to head that off before they invest a lot of time and energy in it You identify opportunities to extend and improve the technology and the performance and it builds relationships in the community Yeah, the other thing is sharing the protocols And a shout out here to lenny. We will use protocols.io But we started out by building protocols into our website whenever we published we built cards that were Interactive and linked to the original papers but also had this very condensed protocol just so people could see how do you do it? What do you need to do? You know writing review articles can be really useful when you have a set of tools like a big toolbox that works for a Class of problems, right? You can do that. It's sort of like, you know, a code repository or a toolkit, right? We write review articles to summarize, you know years of development and incremental publication in a in a field You know, what about specialized equipment, you know, one of the technologies we developed requires kind of specialized illumination Um, and so we built a box to do this in the lab. We refined it Yeah, as things go in the lab, you just kind of keep getting better Um, but ultimately we found that a lot of people wanted to use the technology But were nervous about building up the electronics in their own lab So it's a hundred dollars per part. So we built five of these That we just share On a loner basis and so we send it to people while they're getting going with the experiment We give them the parts list so they can order all the parts They can basically copy the design with the with the test system sitting there They can cross validate and then they send us back the box and we send it on to the next user So this is the evolution of the instruments. This is what I built This is what my graduate student built. This is what the machine shop at Pitt built And this is now what we share. It's basically this but without all the wires and tape And so, you know, this is a easy thing to share And has dramatically improved people's ability To get these experiments working so we share kind of everything along the way All right, and then sharing data, you know, the old way I would have had a bunch of hard drives to pull up here, you know The Sneaker net you go plug into the microscope you download your data and you Wander around with it until you find a computer you can load it onto Or you can move stuff around on box But you know, we found that these methods were unreliable beyond gigabytes And so, you know, this is Work that we've been doing with the Pittsburgh supercomputing center developing a brain image library with support from The brain initiative and this allows us to do things like use globus for transferring data Which is reliable and happens in the background. So you don't Pull your hair out And we'll hear a lot more about algorithm sharing and data sharing and the key issues there But you know, this is what modern data sets look like you go from specimens. This is kind of You know four terabytes of data You know to higher resolution views of that. This is 400 gigabytes, you know to single cell data. This is I can't read way over there whatever it is 40 gigabytes and you know single Cell and subcellular morphology down at sort of 400 megabytes, right? So You know sharing these data sets, you know, you don't want to always be downloading four terabytes of data So we need fast tools for multi-scale exploration with low Network latency so you're not just sort of spending all your time downloading a four terabyte file That's got nothing useful for you. So we need to be able to discover the data in these larger datasets And so we have with the support of the brain initiative a brain image library that's got Almost a thousand brain-scaled data sets that are all comfocal microscopy data sets from a variety of investigators Uh in a variety of NIH funded data collection initiatives Um, and there's a range of methods and initiatives You know, we have metadata policies doi assignment our id assignment common data formats these are pre-processed and converted into JPEG 2000 format so that all the data can be accessed the same way And so this is you know, we need better ways for users to touch the data without downloading the whole data set And that's one of the key areas that we're working on here All right, so i'm wrapping up. It takes a team commitment to get this stuff working You know, we have a team of scientists working on depositing plasmid dna synthesizing and Keeping the dyes You know doing biochemical characterization of things that we share Building light boxes and sending them out developing the protocols and putting them in a credible way as someone mentioned You know, that is not the p.i.'s Hands typing the protocol usually because I would be wrong Um, and then the people at the psc we have a number of engineers and and computational folks really working on the implementation and Working with the community to bring these data sets in so that we have this large centralized data repository that people can come to Thank you. Thank you, marcelle. Any questions so, uh, actually I have a question like it's fascinating that you're kind of picturing the whole of Whole workflow of like sharing resources in the biomedical research not only the data, but also the reagents, right? So I was just wondering was these physical resources. Do you Treat us as a data like people like how do you maintain like for example the versioning of the plasmids? So the plasmids, you know once you've deposited them at ad gene, right? They're there They're associated with the original publication and so I would Be very hesitant to version that Because sharing, you know the resource that you published is a good kind of line in the sand And then we develop stuff you can embargo stuff until publication Or you can just put it there and release it and I think the power of bio archive for some of these Is that you know and maybe Pre-review and bio archive could be a powerful combination when you make incremental changes in tools and you make little bits of improvement One of the gaps in science is that you know, oh, it's not high enough impact for any journal, right? You know and sometimes even plus one is like, oh, this is just you know boring work It's plus one. It's supposed to be a place to publish stuff And so, you know, I think some of that stuff if you put it at bio archive and you can get enough Pre-review kind of comments and people sort of saying yeah, this is credible Then having those incremental improvements of your of your plasmids, you know Then you can focus on the science that you do with the improved tools Instead of focusing on the validation of the improved tool As its own standalone story Any more questions My speaker is Dan Allen. He's a scientist software developer at Brookhaven National Lab at the National Sync Sync control lights us too All right, the the title of her His talk is sharing results and environments with Jupiter. Great Thanks so much for having me. I'm just having the best day today It really feels it feels fun to be among among friends here who are all fighting similar battles As she said, I come from Brookhaven Brookhaven held a conference called open source open science On october 2nd 1999 You can still find the announcement on the internet archive way back machine I found out about this from Greg Wilson whose name you might have heard because he founded the carpentries Formally known as software carpentry data carpentry, right? And Greg thinks that that might have been the first conference with the word open science in the title Which if cool is is a nice legacy for Brookhaven. I have no proof of that, but We may have a nice a nice tradition here Uh, so specifically I work at the giant x-ray beam at this x-ray synchrotron It's a department of energy run user facility So around that doughnut shape building you should imagine about 30 very small research groups Each of which has a user program So people will be granted time In academia or industry to come in usually for about a week and usually but not always in person And take data uh usually day and night and hopefully go home with a publishable result and the range of science that gets done It's quite broad. We've got biology chemistry physics and very many different techniques at different scales Everything from huge amounts of data terabytes in a day to a small batch artisanal data as they like to call it um And so the role of my group at this sort of scientific software group is to connect these people With the open open source ecosystem And one major way that they interact with that the portal through which they they tend to to come to it is Jupiter, uh, I use jupiter as a phd student I've heard a lot of people mention it just to get a sense of the room. How many people have used a jupiter notebook? Okay, quite a few So I was a big fan in grad school and now my role is to help other people come to it As you may know jupiter is built on the backs of a very broad community I'm an extremely minor contributor to to a couple of the jupiter projects This are the faces of people who spend a high fraction of their time Working on it. It brings together everyone from it professionals in industry To university it folks to graduate students like me procrastinating from their thesis Everybody building this stuff together and it's a really wonderfully inclusive and open community Like the the scientific python community from which it grew Fundamentally jupiter notebooks are for telling a story around a computation So you're interspersing this narrative rich content with the code and the results of the code that may include visualizations or or some interactivity this allows you to Communicate your ideas much more query than you could with just a script And it allows you I think even if you're just working yourself to keep your mind on the science and the fundamental problem You're thinking about and avoid getting lost in the code There's a whole lot of pieces and components to jupiter But this document interspersing rich media and and code and results is really the heart of the idea I wanted to highlight some some ways that this can be really nice to look at and to use jupiter provides these rich visual representations of Code and data objects as you're working these don't have to come from the jupiter team or from jupiter itself These are things that external libraries may be specific to certain science domains Can sort of link into and plug into so here's an example of a tool that came from the climate science community And it's representing a labeled data set which with potentially a high number of dimensions And it's a fairly nice way to look at it compared to terminal text for example Here's a cube of data could be a very high dimensional cube of data And it could be spread across multiple machines And here's a nice little cartoon trying to make it easy to see what the dimensions of this are What the bite size of it might be to give you some intuition of how long it would take to do anything with this Another one just an equation write out some math and get a nice visual representation of that So it's it's a nice way to work as we're building up these documents Of course with with any story the most important thing is who's listening who's the audience for this And some of these ideas have already come up today So if it's just you for this week, that's a pretty common case Another common case is the audience's future you when the referee asked you to make the legend a little bit bigger And you're trying to get your old map plot lib code to work or something to regenerate your figure That that's a pretty common audience to take care of maybe you're sharing some results with your advisor or Your research group or a collaboration of people who might be able to reach you But maybe you don't know them really well to explain what you're doing Or people you've never met you don't know who's going to be looking at it and trying to build on it One way to structure that is this phase space of how far the story is going to go How far will the code go? Is it just me today? Or is it many people for a short time and as you place your work in that space You can think about how much effort it's worth to put into making it Reproducible right if i'm just going to throw this out I might not even check that it works or you know I might save it and close it and and know that I can fix it later Maybe I don't even need the code to run But if this is something that I need to come back to in the far future or many people will see At the end of the day if I don't put some effort into polishing it Somebody's not going to bother reusing it even future me won't reuse it. I'll decide you know It's easier for me to just start over So a big part of being able to to build on each other's work is making something that's actually Practically reproducible that someone's actually going to take the trouble to put the pieces back together and make it work for them So if we measure practical reproducibility, I guess we could think about that as The effort required to to send it that is to package something up in a reproducible form And then the effort from the the recipient even if that's future you to to run that thing and make it work So In the bottom left, maybe I just email someone a script email someone a jupyter notebook. That's real easy for me I don't have to think about it Maybe I don't even check that it runs and if i'm very lucky the recipient will open that up And it'll just work for them and everything's great But in the much more realistic case They open it up and they don't have the right software installed on their computer Or they don't have gpu's and the code needs gpu's or something like this, right? Usually it takes a little bit more effort on the sender's part To specify the the environment all the programs that are installed in the context that that code or that Result can be reproduced in The focus of this talk is a tool that came from the jupyter community called binder That aims to reduce the effort required on the part of the person who's reproducing that code So as the sender I still need to package something up somewhat neatly And that takes some effort from me, but it takes very little effort Just one quick to have access to an interactive environment where you can work with someone else's result So what is binder it's built with 100 open source technology So anyone can fork it on github customize it to their heart's content and deploy it themselves It uses the modern technology in cloud orchestration I don't know how many people here have heard the word kubernetes But it's an open standard that builds on top of aws and google cloud and microsoft azure It lets you avoid getting locked into a specific cloud vendor. This can be really important for Open and equal access. It's possible to run kubernetes on in-house hardware as well So you could just avoid cloud providers entirely and by by hitching themselves to that binder has made sure that this tool was really and truly open It's using jupyter, of course and the open science ecosystem to to provide the actual data analysis code Useful for researchers useful for educators and scientists So to talk about the sort of the pieces of binder the actual software projects you could find on github There's a tool for taking a github repository with your data analysis scripts and turning it into a docker container Then there's jupyter hub, which is a tool for making those containers available in the web browser Which is really where the the ease of use comes in There's a nice ui for building these things so that you don't necessarily have to know the word docker or what all these pieces are You can just use it And then there's my binder.org, which is a free sort of demo deployment of this anyone can run their own binder But for most people at least to try it that by far the best option is to just use my binder.org I'll talk more about how that's funded and how that all works. That's that's the sort of canonical deployment maintained by the jupyter team So from the recipient's perspective There's say a some blog post and it mentions some reproducible result with a link All I do is the recipient is click the link. I'm set to this binder page I wait patiently in expectation And then I'm bounced to a jupyter notebook This is not a static copy of the notebook. It's fully runnable So code is being executed not on the user's laptop, but somewhere on on some cloud machine And they can play with these results. This is a toy example here with with wren's equations and interactive sliders The probably most popular binder to date is the LIGO result the gravity waves paper from a couple years back They made a binder out of that and so people could reproduce the plots that were actually in the paper and see what happens If you change some of the parameters and play around So the the concept I guess is that by wrapping your Notebooks up in a binder You make it very easy for people to get started and play around and land in a software environment with everything just installed and ready to go So how do you do that? From the sender's perspective, you need to put your notebooks on some public git repository It doesn't have to be git hub specifically that's a popular choice And then you need to write down what software needs to be installed So that could be a text file listing python libraries or or many other things as I'll show And then you visit the my binder.org or or any binder deployment and you enter the address of your github repository You click the launch button You wait while it transforms your github repository into a container and then you Copy that copy the link that it gives you back and that's what you'd give to anyone who wants to reproduce your results Um Megan pointed me to a really nice cartoon Animation version of what this is like that I just tweeted on the conference hashtag right before this session So if you look in the history of the hashtag, you can find a much cuter version of this that I wish I'd known about before my slides So a really important thing about binder is it's reinforcing existing best practices. There is no binder config file It's just looking for Files that communities have already identified as useful ways to encode Reproducible work and then building this really easy to use experience around that So two important consequences of that is there's nothing new to learn binder is just guiding people scientists and researchers Toward existing best practices and and since in a sense rewarding good behavior And if binder goes away as many technologies do or is replaced by something else Its legacy will be all of these github repositories that have a self-contained story and everything you need to get started So nothing specific to this particular technology And if you're interested you can do any web app based application this way is not restricted to uh to jupiter notebooks Here's a probably inscrutable list of all the different Configuration files that they support so far their ongoing effort is to go to scientific communities and understand What their reproducible best practices are and try to enable them with binder? So you can see r is mentioned here julia is mentioned here python is obviously a big part of what they do and more on the way So I mentioned my binder as sort of the canonical uh jupiter supported deployment of this they run it as a Radically open service so you can see their costs How much it costs per month to run the binder? I believe the cost boils down to One to two cents per user. So it is pretty cost effective to do and They're working on on federating this so they have the second The second contributor to this now they're working on creating more Where when you go to my binder org and you start this Reproducible container it might be on hardware that's paid for by the jupiter team where it might be on hardware paid Paid for by some other member of the federation and by pooling resources in this way organizations that want to support the work of binder can Get on board and and and and support it financially by taking some fraction of the overall traffic As I said earlier, it's built to be easily deployable and not tied to a specific cloud provider So you can run your own binder if you want and the documentation on that is quite good At brocave and national lab we've remixed A deployment of the pieces of binder for our own needs We have large amounts of data that we don't want to pay to upload to the cloud So we run the components of binder locally so they can access our data right there Somebody else mentioned singularity earlier It's just like a secure version of docker that we do inside the DOE and we we want to give our users a gallery of Binder compatible repositories of examples so when they come in for that week and they're focused on their science and their data They've got a nice gallery of notebooks that shows them Some basic workflows for for analyzing their data acquiring their data and and hopefully Distracting them as little as possible with all the tools around that So last slide you should check out my binder org. There's a github organization full of example repositories that you can try Uh, the chat is very friendly and responsive So if you have any questions at all you should come hang out and meet the people there And if you're feeling ambitious at the bottom you can even try to deploy your own Binder without happy to take questions. Thanks Any questions? So how does data that the jupyter notebook enter the picture? Is it part of the github repo that you link the binder to or how does the data in the jupyter notebook? Or the data that you're using for the analysis, right? That's a really that's a really good question. So for really Large data sets you shouldn't keep that under version control and git as you may know And the usual way to handle that is to have a step that pulls the data down like at the top of the notebook But because binder is a free service It doesn't have a ton of associated storage for data So if you wanted to have access to really large data sets that were like already pre-loaded The solution is probably to run your own binder hub an example That would be pangeo, which does binder for large climate data They have their own binder and their own very large data that it has fast access to Thank you Yeah, actually I want to Have a clarifying question for this Like do you have a ways to limit that people upload the data onto the binder hub? Um, I don't know what the limits on storage are for the lifecycle of the binder But there are limits and you get a something like I think you're guaranteed a gigabyte of ram Oh, actually, I just meant like you have ways to restrict people from loading the data Yes, so kubernetes gives you ways to put limits on what a user session can do And those limits are configurable if you're running your own binder. Okay. Thank you. Any more questions? All right. Thank you. Sure. Thank you Our next speaker Is a Greg way He's a postdoc associated at at the imaging base profiling genomics at the Broad Institute of MIT and Harvard He's going to talk to you about Using github as open science lab notebook Got it. Yep. I think I do All right. Thank you everyone. It's good to be here. It's been a fun day so far Um, thanks for the invitation to speak and so yeah, I'll just be going through the next 15 minutes about um There was a problem No Okay There we go. Okay So yeah, I'm just going to take the next 15 minutes to talk about how we use github in our lab To pursue open science as a lab notebook And so I don't need to define open science to this audience, but just so that we're on the same page I'll just talk about what we think of open science Of course, it's transparent in the data that's input the software that we use the results that we output and also documentation about how The results are analyzed and interpreted Open science is reproducible. So we'll have in our github repositories analysis pipelines computational environments and their various documentation And it's also it's also empowering so Open science will increase participation and reuse of your code increase the trust in your science and also speed the pace of science So you can think of open science in this framework from As a spectrum between closed and open science Where software documentation will documentation reproducible compute environments results and data are all Open in a version controlled develop version control environment Throughout the course of development. So from the initial github commit all of these things are are made open So i'll be focusing on using In this talk about how we use github as a solution For our lab notebooks and I think it's also important to note here that Whether or not the project itself is completely open from the start All of these procedures is exactly the same. So whether it's It's open from the from the initial commit or it's made open by the end where you're required to do so from by the publication This is the same exact procedure So it's also important to note that when you think about github at least the people that I talked to Most of the time you're actually thinking of github using github as a software package. So like an open source package And we're actually in in this talk. I'll describe github from the perspective of using it as an analysis module So the distinction here is that the the primary users of a software package will actually not interact with the code They'll be interacting with some product that the code is generating whether it's a command line product or some GUI Whereas in an in analysis module the primary users will actually interact with the code and more most often You will actually be the primary user or future you will be the primary user, but others can also interact with the code in the open And so we'll be talking about open science lab notebooks for analysis modules So here's just the practical guide to maintaining one of these things And I thought that this was a helpful way of structuring the the presentation And that is by using the definitions that are provided by the national institutes of health. So the NIH put this presentation Out there and the way that they define a lab notebook is That it's a complete record of procedures reagents data and thoughts to pass on to other researchers A lab notebook is also an explanation of why experiments were initiated how they were How they were performed and the results It's also a legal document to prove to prove patents and defend your data against accusations of fraud And represents a scientific legacy in the lab And so an open science lab notebook that we use that we develop on github Is all of these things although i'm not going to make really any statements about using it as a legal document So i'm not a lawyer. So i'm not gonna i'm not going to touch that in this talk only at 15 minutes But in the same presentation The NIH also describes what a lab notebook is not And so according to them a lab notebook is not a journal. It's not a record of communications It's not a place to compile lab protocols or manuals and it's not yours to take home So i understand where they're coming from in this presentation So i think what they're thinking of is more of like the molecular biology wet lab lab notebook Where it's like an actual hard copy that you have to like you get your results you print out your western blot You tape it in the actual Lab notebook you sign off on the bottom of the page. Maybe your supervisor has to sign off on the page too And so in that sense it doesn't really make sense to have these these elements In that Specific place, but from an open science perspective. I argue that a lab notebook is actually all of these things And so now i'll walk through each of these eight components and talk about specifically how We implement these solutions on github and so The first one is that it represents a complete record of procedures data and everything else So this is just an example repository here Where if we remember this is an analysis module style so all of these different elements represent analysis modules It's kind of how you see out there It's almost impossible to see but these are these are standalone units That um that are self-contained So each of these represents a specific step in the analysis whether it's training a model or applying a model Processing the data downloading the data, etc And these components are also Within embedded within this same repository is an analysis pipeline So that a user can go to that specific pipeline whether it's a bash script or a python script And know exactly the steps that are required to reproduce your analysis Also included as a compute environment. So in our lab we use Kanda, so this is an environment file that specifies the package versions that someone can really quickly come to your landing page of your repository Know exactly the environments that you're building But an important consideration that has been continuously come up in this Symposium is where to put data and so we know that github is not a universal data solution You can use solutions like getlfs to manage files that are up to two gigabytes, but those are also bandwidth limited So it's not necessarily a one-stop shop for all data Um, there's for like kind of intermediate sized data. There's solutions like fig shares and odo That do a really nice job of versioning the data assigning a doi And also that that hundred gigabyte ceiling is kind of a fuzzy ceiling. You can request more storage But then for data, that's much bigger than that You'll likely need some data type specific solution like geo or sra for gene expression data or sequences or idr for images But what's important from the perspective of a github lab notebook Is that the instructions will need to be in the in the lab notebook itself? So not only download instructions and processing instructions, but also uploading extra instructions So how that data was actually Put onto that server whether it was through an api or something similar It's also really important to version your data. So secondly a lab notebook is in a complete explanation of experiments and results So i'll just motivate this with it with a concrete example So in this github issue, uh, we were interested in this one specific question in my lab. We study cell morphology And we had this question about How does uh crisper perturbation influence Morphology and how reproducible is the morphological profile that results And so we had this one question. We had this data set. We got an answer We documented it in the issue. We linked it to the code that was associated with generating the results And what was really nice is like a week later We had a collaborator that was interested in this exact question and rather than having to rerun the results itself We just sent them a link to the itch to the issue and that was it We didn't hear from them again. So either it was a terrible answer or they got it to work or something But another important consideration here is that you have to balance the documentation And there's lots of different places to put documentation Sorry that this is a little bit of a busy slide But there's essentially four basic elements where you put documentation on github That's in issues and pull requests in notebooks and then in readme's And we're still experimenting with where documentation should live so that when someone comes to the repository They expect to see certain information in certain places But basically we put in issues the experimental setup. We track decisions Whether it's the finding thresholds or deciding how to normalize data We present the results and brief and brief interpretations and the issues We also use pull requests to link the specific issues that is Answering the specific hypothesis or Describing the experiment and on github. There's diff features where you can see how Subtle changes to your code may have changed The results or may may have changed your interpretations Like we just heard from Dan the notebooks are really heavily used in this ecosystem They really nicely link together code and data and documentation in one place and you can use Jupiter notebooks or our markdown And readme's is where you'll place the essential elements of The the instructions to reproduce your analysis and what's important is that each module should get one So you'll have a landing page on your when you go to the first github page But then each of those individual Folders will have their own Readme's because they should be standalone So this this part here is Unfortunate that scientific fraud is much more common than is ideal and we use open science and lab notebooks to also help spot bugs I have a link here. I think kasey's in the back here. This was one of the one of the worst weeks of my phd I actually did my phd with kasey Yeah, so he probably remembers this pretty well So we had a paper that was accepted to a conference in uh in hawaii actually and I was really excited to go And then I found this bug a through code review. It was a one off off by one error and Turned out not to be too bad if we didn't necessarily Change the results all that much But it's it was a nice way of spotting the bug because it's way worse if we found it after publication And I won't talk about patents So certainly open science lab notebooks represent a scientific legacy in the lab and also beyond so Definitely if you invest the time to write thorough documentation It enables reproducibility and it's a very common task for researchers yourself Maybe rotation students and others to get up to speed with the the basic standards of the lab It also empowers individuals outside the lab speed science and fosters collaboration and Probably one of the major things is that empowers junior researchers. Well, I do think that open science increases the burden of And it's mostly built on the I guess the burden has shifted towards the junior researchers I think it is what we should be doing all along And if we get it started soon, then the junior researchers will I guess Have these skills really quickly And so it empowers these junior researchers not only as content creators, but also content consumers Okay, so the next bit is What a lab notebook is not so I actually think that a lab notebook is all of these things and I'll go through these I understand what the NIH is saying here. So you're not going to write like a personal diary in your lab notebook that You know, it's going to have all your scientific experiments But it's possible to do this on github. Just make sure that you have a consistent labeling scheme Through github issues. There's a nice way to label each issue as a specific Thing that's like really hard to see Okay, well, yeah, they're colored. So sorry Yeah, so if you have like maybe you read a paper or something and like one specific result was Really super interesting to you and you thought I could relate to some of the things that you're doing Maybe just write it jot it down and you're as a github issue and then add a specific label to it A place to record communications So it's certainly not a place to record all sorts of communications like what are we doing for lunch or something like that but it is a nice way to Communicate with collaborators collaborators about something that's specifically relevant to that topic and so in in Implementing these solutions we've collaborated with a lot of people that don't necessarily have github accounts so we've Had some pretty positive experiences of like making them like kind of forcing them to create a github account So they can interact with the science In real time and like kind of requesting hypotheses or providing alternative explanations So that's been a actually pretty pretty good. Although there are some technical barriers Yeah, a place to compile lab protocols, so I completely disagree with the NIH on this one so Github is a really nice place to do this. Although there are more tailored solutions like we heard about today with protocols.io You could create a dedicated repository for that specific asset development to track what works and what doesn't And you can include negative results in this type of repository And then probably the most important thing is that in an open science case this Lab notebook is your is certainly yours to take home not only belongs to you But it also belongs to everyone in the open as long as they can find it So you also have to make sure that you add a permissive open source license to it So that people can reuse it easily So in conclusion github can be used as an open science lab notebook Whether or not you're conducting it from the open from the first commit or making maybe making it public by the end For a publication the framework should be consistent throughout But there's many careful considerations that are required for data documentation and communication But I also think as a concluding remark that there's a lot of space for new tech solutions that are Could be designed specifically to enable open science lab notebooks that don't necessarily have the technical burden of The technical barriers associated with non-computational scientists But also are high weight enough to handle the reproducibility aspects of like version control like github So with that I'd like to thank my current lab. So Anne and shantanu have been supportive in Kind of letting me take this open science idea to the road and the lab specifically So yeah, also other collaborators That have been involved with some of the projects that I had talked specifically about and then also kasey and daniel Who got me started in open science when I started my phd. So thank you all Thank you any questions Thank you. I'm interested in this idea and I I've first heard about Having this kind of set up for a lab notebook From osf because they're also kind of promoting this idea of having something like a binder not to be confused with the other binder that we just Heard about Um, I I saw johns hopkins has a really beautiful template for lab notebooks and osf And I was just wondering if off the top of your head. Can you give us some like pros or cons for using one platform versus the other? Yeah, so I don't have much experience with the others that you mentioned But I there's certainly room for for others and I think it's worth experimenting to figure out what works best for you So far github has solved a lot of the problems that I guess previous solutions. So we were using confluence a lot before And they don't have as nice like autolink features as github does And like the notification system on github is a little bit better Um So, yeah, some of those like automatic template repositories are really nice because a lot of the work Uh, some of the some of the work that we do Is you get a new data set in and then you have to preprocess it like kind of in the same fashion as you would previously So it's nice to in github has these template features now Where um, you kind of build the backbone and then when you spin up a new repository It kind of fills in all the things that you would have to normally fill in so Any more questions Yeah Yeah, so I'd say that um I've trained like quite a few people on this platform Um, I think the barrier to entry is really high and it's steep But like after you get to that point, it's kind of levels off. So like it's consistent internally consistent But like hard to learn um I'd say just like continue to be supportive and like just tell them that it's hard I'm still learning myself. So Um, yeah I don't know a couple weeks Hey, I've got two questions for dan Um, the first one is is the difference between using a jupiter notebook and something like google's colab the story part of it Is it just that there's so much more context to? What's happening? Colab Colab is using the same notebook document. Okay But it's a fork of an old version of jupiter hub that is not pushing changes back upstream So I would say an important difference with colab Is you can't run your own and if google abandons colab the way google has a history of abandoning projects sometimes It goes away excellent points. Okay. The second question is that you mentioned Um, the galleries that you have Uh, is that something that you guys built or could I go and get disciplinary like? Galleries to share with my researchers. Yes So, uh, if you've seen nb viewer or if you've ever viewed a notebook A jupiter notebook on github look a static view of it where it's not interactive, but you're just you can just read it that nbviewer.org is another jupiter service you can deploy that yourself and that's that's what we're using as a gower and then we have some local Customizations to kind of add winks that that hit internal services and it seems like the That's something that at least right now you have to do yourself. Everybody has their own approach nb viewer.org So this is a question that hopefully may be briefly each of you can answer I'm often asked when I present protocols. I oh Won't people Be afraid of sharing openly don't they want to don't scientists want to sit on their secret methods I think there's enough evidence that Some do but I don't think the majority But you're sort of the extreme opposites of that spectrum where you're super open Right and you go above and beyond but most of us Usually we do could you each say a few words on what drives you to be extra open? With the time extra time that it requires Those guys are jerks and they're dying No, I mean, I think you know impact is not just the science that you can do but how the science that you do inspires other people to sort of Follow the lead or innovate on their own So I think building a culture of innovative approach Is why we share as aggressively as we can Yeah, I totally agree with that So I think impact is the main thing for me. So It's nice to share my code and then see people use it And then say like that I've never met, you know They're like thank you for sharing this code Yeah, I would echo that completely I had a really formative experience in my phd about three years in I was Encoding this like 1996 paper into python because no one had done it yet And I had started at one end of the problem And I found a random stranger on the internet who had by luck started at the other end And we were just about in the middle So I opened a github issue and said hey, buddy, we're done And it was great and now other people on the internet. So you right and more stories like that I think will teach people the broader lesson of that which is don't underestimate the benefits don't don't focus on the risks So I have a question for everyone of you like you The two of you talked about like using the like this command like the software tools and then to Do the data analysis and much of our cell I know you have a lot of experience with like for example Working with data in the brain imaging Library, right? So can each of you comment on like a little bit how what are the Frustrating moments to deal with like code and data try to Compile all the components pull all the components together and what are the best practices you recommend So as a non expert user You know, I think the frustrating thing is pushing electrons Not an organic chemistry class, but the idea that you need to move data sets From the server to a local machine where you can do some computation on it And so we've spent some time recently trying to sort of reduce our data transfer latency and build local servers with onboard fast drive space I'll just put a shout out for Server grade hardware recycling it turns out that you can buy lots of recycled Servers that are just being moved out phased out of data centers And they're super cheap and they seem to work at least from some of the refurbishers that we've found And so we've built our own jupiter hub now and it's got 40 terabytes of space so we can move big data sets onto it And then we can build our own parallel processing workflow because I'm not Expert enough to feel like I should use the supercomputing centers resources To you know, pittle around with my data sets until I know what I'm doing and know that it's producing the results And now I want to do it on you know a hundred grains. Okay. Well now I'm moving out of that environment And binder is going to be fun for that, right So for me the I guess the the most frustrating bits of pursuing science in this way is projects in which There's a bunch of collaborators a bunch of different people with different Backgrounds and there's a complicated project that requires both molecular bits Some other computational parts and then data science so in in developing these assays There's a lot of moving pieces up front And so when something changes Slightly in the the molecular phase it kind of trickles downstream and causes large changes in the data science parts and so when you're working with large data sets and Having to do these reproducible analyses and rerunning pipelines whenever things change It takes a long time to run And sometimes subtle things Change the way the data is input and so then Your collaborators are expecting results quickly and where I guess open science kind of slows down is when you have these cases And so maybe they've collaborated with computational scientists in the past that haven't necessarily Done these these open science workflows, but they can get results much faster but I think The slow moving slower up front kind of saves time later on at the end Yeah, building on that point. I think There can be the tension between move slow and do things the quote right way and move fast and just get get an answer But maybe we're not sure we could do it again the same way And knowing where you are in that space of how many people are going to be using this again and and for how long I think helps have those conversations more constructively, but it can be challenging to to communicate about that So it seems like all of you have built or using parts of communities that are building things that are really Great and helpful for science So my question is what could you use? More of or different that would be helpful for this mission Is it money? Is it people? Is it time? It's probably all of the above, but something Something work on create that as an institution or community We can Come together to to help support these endeavors Since I have the mic I'll say a quick one to steal someone else's point on this Tracy whose last name I'm forgetting, but I think she's the executive director of the Carpentries or was recently Yep, Tracy teal. Thank you Tracy teal said that when they did polls of researchers asking them this what they said was the time to learn So if the funding agencies can give cover for people to slow down and to learn how to use these tools I think the the desire is there the management cover the funding cover needs to be there I think a challenge on some of these is sustainability that you know, you might have funds like for example the brain image library that we've got is funded through the brain initiative until a brain initiative goes away And then we're gonna have You know 40 petabytes of data and what do we do with that? Right, so I think the sustainability is you know sort of perennial problem with the resources and Yeah, I agree just to echo that point I think the the main thing that would be helpful to me is having resources that can store the data that I'm working with in a versioned way That is sustainable and Yeah, more more of that and it's nice to see now like NIH is back in big share and EU Funding is funding the Zanotto so it looks like these types of solutions are here to stay But I think data is going to only grow and these problems are going to get bigger and I think there was another interesting point that was brought up earlier in the session And that was how long is a data set useful and You know, maybe 30 years from now if all of these solutions are figured out and we have now permanent storage Facilities for for data like are we going to keep data that was generated 40 years ago? Maybe that's useful. Maybe it's not but I don't know I want to add to your answer there just for how long is data useful and also like data is Keep evolving, right? The data keep evolving and your software keep evolving. How do you like evolve your software to keep track of to Fit your data or is it up the other way around? Yeah, um evolve the software to fit the data Oh, ideally, it's a feedback a feedback loop in the end of the day. We're trying to do science I guess so um I guess you you are we talking about formats? Is that what you mean evolving the data or how it's stored It's just like if you have like update of the data set then your software your tool somehow has to Take that in right? Oh, yeah Yeah, the subject in general of a reproducible workflow management makes me feel dizzy and want to lie down I don't really know it seems that that's been solved a million ways for every different field or every different lab At a high level notebooks are a way that people reach for but I think people are looking for ways to add More structure to that the project cheetah that came up briefly earlier is one approach and there are Literally hundreds. So I don't know how that's baseball Perhaps with enough data sets You can envision a scenario in which you have a data set changing slightly And maybe there's like a machine learning algorithm that can detect what subtle changes that will actually produce downstream um So maybe we'll we'll figure out that problem By using similar computational infrastructures I think tem mentioned when he was talking about the functional MRI data that they do, you know, one of the Really powerful things in image data sets at least is the pre processing And you know that these days with things in biological microscopy It may be that data sets that we collect at 10x resolution today with advanced machine learning deep learning algorithms, you know, essentially Produced molecular scale data 20 years down the road with all the computational resources that You know are brought to bear between now and then so I think you know a lot of these things you may have unexpected lifetimes because What we get out of data sets at the limit of data collection today Isn't you know isn't information limited just by the data collection Any more questions? All right. So let's think I'll give the panel a round of applause