 So my name is Kelsey Montgomery. I'm a senior bioinformatics engineer at Sage BioNetworks, which is a health research nonprofit in Seattle. I have worked with and manipulated large amounts of data for my entire career from coral reef in the Florida Keys, microbiome samples to mitochondria and neurons to currently the work that I'm doing to support open science communities in the analysis of large genetic data. So scientific research generally follows this arc that you see on the screen, a testable idea, we plan for the experiment, we organize the data, we test our hypothesis, we retest our hypothesis, we synthesize these results and share our findings with the community. It's linear, it's clean, it feels good, right? However, research can start in one place and end in a totally different place. So how do we justify that journey? How do we communicate about that journey? In practice, science gets complicated fast. This is a super common problem in our technological world. The sheer magnitude of information can feel daunting and inaccessible, yet we need sufficient data to build robust claims. As these data sets grow, so does the complexity. There's so much movement and dynamic change in an analysis. For example, in my life, I might revise some code, I might share preliminary findings with my peers or I might reprocess the entire data set and reproduce several thousands of files. So tracking these transformations is important because I want my results to be credible and I want someone else to be able to independently verify my claims. All these transformations are really critical to that independent validation. So how do we share our journey, our findings and communicate our learnings with the broader community? Where I work at SageBio Networks, we use a tool called Synapse that solves this problem of complexity and sharing very well. Our engineers who designed and built Synapse worked hard to implement fair principles, findable, accessible, interoperable and reusable. So you will notice some of the ways that we applied fair in Synapse as I described this tool to you. I'll walk you through some strategies for organizing your data for sharing, how and why they work and how you can implement them in your own professional life. I'll mention this again at the end, but you see a link there to synapse.org slash CSV comp 2021. This is a public project with this presentation and you can check out some of these features and strategies and actions. I definitely want to acknowledge that contributing to a public resource like definitely requires the right incentive structure. Most scientific research is inherently incentivized to contribute to a public space, whereas other industries that you might be a part of might require you to devise some more creative ways to make your work open. So working with data and analyses in this way is the style of practice. And what do I mean by a style of practice and what does the style of practice mean to you? So for me, one of my daily habits and practices involves yoga. There are concrete rules of yoga and when implemented routinely become integrated. Sitting without movement is hard but might gradually day by day become comfortable, breathing in a certain way might be accessible until slowly you're able to like integrate it with your movements. So I would love to know how many of you have already established a style of practice. Earlier today, I really appreciated Ryan's idea on snippets and I totally plan to incorporate that into my daily or weekly life. I would love to see you some other ideas in the chat. So solving the problem of complicated science and sharing, our goal here is to develop a process to incorporate into your daily work. In my experience, I find that there's a common professional insecurity about sharing our nuance, sometimes messy, sometimes inaccessible process. We experience a lot of ups and downs, trial and error that get us to an insight. I think it's very important for us to normalize this process of sharing work early and often and in doing so we'll reinforce this kind of muscle memory. So Synapse is a tool, it's open source, it's available for anyone to use and Synapse helps achieve these three goals. Share your data responsibly, make your work accessible and make your work pass independent validation. So the concepts we're about to explore all map back to these three goals. And why these goals, it's always important to answer the why. Responsible sharing, this seems like a given, maybe, but requires special considerations. I have a special relationship with this, with responsible sharing because my work deals with human data and it is critical to plan for responsible sharing in order to mitigate harm to research participants. When we talk about accessibility, accessibility empowers community and empowering community is the best way to get to the best answers the most quickly. Science is a community and I wanna share my learnings. So independent validation, first of all, what does this mean? It means that you can arrive at the same insight by using my process. So this is an incredibly daunting and difficult task. It's a challenge, data is complicated and big, I need to organize my work for sharing as I go. These goals are not mutually exclusive, the more accessible my work is, the easier it is to validate independently. So let's deconstruct this. How can I share data responsibly? Just as when any information you make public on the internet, special care and consideration must be made when sharing your data and complete analysis. So ethical norms are critical. It is not ethical to put some human data in the public domain and all genetic data is at risk of re-identification. However, you can implement a permission structure in Synapse to keep your data restricted to protect a participant's identity. You can decide one file is visible while another is not. And there are some strategies for preserving accessibility that I'll mention later on in this presentation. For example, to keep the large omics data we host on Synapse accessible, interested users might need to state their intended use of the data. And beyond human data, any healthy community benefits from a code of conduct to communicate integrity and respect, anyone who interacts with Synapse must agree to this code of conduct. How can I make science accessible? So when you have data that is appropriate to be shared publicly, look for opportunities to not put your data or results behind a paywall. In our tech community, this might look like documenting your code and methodology in a public GitHub repository as journalists you might rely on ProPublica or the intercept for public information. In genetics, it might be the biomedical data from the All of Us Research Hub. So in Synapse, we make sharing really easy. There's a button make public and your entire analysis, files and supporting language is available publicly. Permanent. So this might seem obvious. We all love and hate that if it's on the internet, it is there forever. But when I say permanent, I imply an immutable reference. So the strategy is potentially more specific to data management and project reference. In Synapse, every entity, which could be a file, is assigned a unique immutable identifier. This entity can change over time, but the identifier does not. This works by appending a version number to the unique identifier. In my illustration, this unique identifier can get you a specific file within this linear history. So a single source of truth certainly reduces complexity. I can always find a version that I created before and I can also find the exact version that led me to my insight. To bridge the gap between data and publication, we can link these file transformations to digital object identifiers or DOIs. Data set citation is super important. A DOI is also for those of you that haven't interacted with DOIs before, an immutable identifier in many journals require you to register your data bits by assigning them a DOI. So in Synapse, you can mint a DOI directly with the click of a button, which is fantastic and incredibly useful at times. Make your work discoverable. The world is full of information. How do I find your project? How do I find your research? The way in which I encounter your work is entirely up to you. So one strategy for making your work discoverable is to add metadata. Meta means higher order. Metadata is the information that describes the data. So for example, here's the data of what I do. There's large amounts of genetic data from brain tissue of more than 2,000 individuals diagnosed with Alzheimer's disease and this data is available in the AD Knowledge Portal. The metadata of what I do, human data, RNA, brain, Alzheimer's disease, open science. So we can distill our work into a few concrete terms and apply it as metadata. Machines require structure. So when I say machine readable, I'm referring to the structured data like a common vocabulary. So I discuss the importance of some human data not being appropriate for the public domain. One strategy for remaining discoverable is to keep the metadata describing the files public. This can enable some accessibility. So Synapse accepts a structured vocabulary to apply to your projects and files to make them discoverable. And we, the Sage Bio Network scientists have also developed a common vocabulary or ontology for our open source communities to make data discoverable. And where possible, we build on existing ontologies. We definitely rely heavily on other open source communities. I also wanna throw out that we have built programmatic clients in Python R, REST and the command line to interact with data on Synapse, empowering broad accessibility. So you can kind of build this pushing and pulling of data into your analyses and into your processing, build it into your workflows. Beyond Synapse, you can also add meta tags in Google to make your work discoverable. So this is being applied in other tools and domains elsewhere. Discoverability, public, permanence, all these help maintain the accessibility of your work. So let's recap. The bigger problem we're trying to solve is how to reduce and communicate complexity in our work. We've touched on making our work accessible, being responsible, data shares, let's explore how to create work that can be independently validated. We want robust claims and we need our work to hold up under public scrutiny, so let's set our critics up for success. Make your work have provenance. Provenance is the chronology of ownership. If you incorporated a raw dataset into your analysis, this could be an incident from the US Press Freedom Tracker API, cite the raw data such that someone interpreting your analysis can find the exact CSV with the line item incident. So the only way someone can arrive at your same conclusion is to follow your process, just like leaving a trail of breadcrumbs back to source. Document all the ways you transformed your data. I need to know how you arrived at every insight, every step of the way. So we have a feature in Synapse called provenance where the data can be linked together by unique identifier and additionally we can link code to output. Imagine something like the Tree of Life or something like this illustration here, an interconnected web of transformed data. The brilliance of provenance is, it is a snapshot in time. A file might continue to change and evolve, but the provenance of your analysis preserves the connection to data at the time of your analysis transparency. So this is where leveraging technologies adjacent to Synapse can be the missing ingredient to independent validation. This strategy or goal is rather specific to the computational world where large amounts of quantities of code and information about the working environment specifically must be documented in order to reproduce an insight. To be transparent with large quantities of code, leverage GitHub, which I'm sure all you superstars already are. Additionally, I heavily rely on a technology called Docker to guarantee someone can rerun my analysis on their own. A list of instructions that include information like the operating system can create a lightweight mini computer. This mini computer only does the things you needed to do to rerun my analysis. So that's Docker. Five minute warning. Thank you. Synapse supports linking to external sites, including immutable GitHub references to connect data to code. And Synapse also contains its own Docker registry. So having provenance and transparency help ensure your analysis will pass independent validation. All these strategies are interrelated and together we have a kind of mind map to reference and incorporate into our style of practice. So going back to the beginning, science gets complicated fast and we wanna minimize that complexity in order to share our findings with the community. We established three goals in order to solve that problem, accessibility, independent validation and responsible sharing. These six strategies will help you achieve those goals by providing a structure that you can incorporate into your daily work. Share your nuanced process early and often. Make your work discoverable, public, permanent, ethical, transparent and have provenance. So just like sitting and breathing for one minute then five minutes then forever long you care to sit and breathe in one place without moving. This practice of staging, structuring and sharing your insights can find its way into your day to day. Like all great analyses, the contents of this presentation originated somewhere, if not my brain. So I connect this presentation back to the open source art generated by photographers on Unsplash, the synapse features I mentioned, the various examples I cited in the open source way 2.0 for their articulation of the style of practice. So thank you again CSV Comp. This is such an amazing group of participants and I've really enjoyed being here and I look forward to your questions. Thank you very much Kelsey. Let's take a look. If anyone has any questions, make sure to throw those in the chat or in the question box below and we'll get these to the presenter. Really a quick question. Can you talk a little bit about two questions came to mind. One about the collaborative nature of the work. Is this only at that sort of final stage where you're sharing out or is there pre-work? Yeah, so my goal is for someone to leverage synapse to like collaborate consistently through their entire process. So you can imagine that synapse is kind of like a staging space where you can document not only the ways that you're transforming data but then invite people to come in and view it. So you can invite a synapse user to have access to your project. You can invite a team. We have several different sharing structures available.