 Brett Smith. And Brett will talk about Arvadas. Achieving computation of reproducibility and data provenance to a large-scale genomic analysis. Well, that's definitely covered the gist of it right there, so thank you for that. That probably saves me a minute off my speech. So like he said, I'm Brett Smith. I'm a developer at Curoverse, and we develop Arvadas, which is an open-source platform for storing and computing your bioinformatics big data. Just to review some of the very basic information that you'd want to know about Arvadas, definitely feel free to check out our website. Obviously, that's where a lot of the development work is focused. Arvadas is licensed under the Ganua Faro GPL version 3. We do have SDKs under the Apache license, so if you need to develop scripts that run on the Arvadas framework, you can do that regardless of their licensing. But all of our stuff is copy-lefted. It's written primarily in Rails and Go, basically anything that's web-facing, gets written in Rails. Most of the other stuff is making its way in Go. We have some other tools in some smaller languages, or some other languages, but that's the way things are headed right now. So this is a session about reproducibility and about provenance, and I want to give you a brief overview of the architecture of Arvadas to help you understand what we've done to support those goals in bioinformatics research. The first piece of our infrastructure, and probably the most critical one we actually have, is our storage layer. Whenever you want to do any kind of analysis, you need to start with some data that you want to do the analysis on, right? In the Arvadas system, the first thing you do is upload that data to our storage system called Keep. Keep is content-addressable storage. What this means is that when you want to get the data back after you've put it in the storage system, you're going to refer to it by the content itself. Or rather, we're going to cheat a little bit, and we're going to use a cryptographic checksum, the MD5 sum, in our specific case. So when you put the data into Keep, you tell it, here's the data, here's the checksum. Keep will double-check that you got that right, and then say, okay, I have your data. When you want to look at that data again in the future, you tell Keep, give me the data with this checksum, and it returns it back to you. Pretty simple concept, fundamentally, right? But there's a couple of interesting principles that shake out of this that I want to spend a little time dwelling on. So two interesting facets of this. First, the data is immutable. Once you upload data to Keep, you can't go back and say, oh wait, I need to change this. The most you can do is simply upload a new version of the file. It'll have a different checksum. You'll address it differently. So this way, you don't have to worry about having, you know, whether the version of the file foo.txt that you have is the version that you actually should be working with. That will be clear from the checksum itself. So you also get a very rough versioning system. It's not history. It doesn't tell you, like, you can't tell, given two versions, you're not sure how they relate, but you can at least tell which version you're working with, the version with checksum A or the version with checksum B. So given these properties, now, obviously, like it would be, as a practical matter, it would be difficult for us to ask people to upload gigabytes or even terabytes of data all in one go in a single HTTP request. Keep is all HTTP-based. When you upload data, you do a put request. You get it back with a get request. So that doesn't necessarily always scale comfortably when you're dealing with very large amounts of data. So how do we tackle that problem? We have keep manifests. These are a simple textual format that describes how you can get sort of the original data structure out of a series of keep blocks. So at most you can upload 64 megabytes of data to keep at a time. And our upload client will automatically take care of dividing your data into blocks for you. When you get the collection back out, you have a manifest that describes exactly which blocks you need to get and where the files are in those blocks so that you can reconstruct the original data as it existed on disk. So just as a very brief overview, here's a sample of how keep manifest works. I know there's a lot of random characters on there because that's how checksums work, but it's very simple. There's just the list of the checksums of the keep blocks involved and then the list of files that are in those blocks and their locations within them. So relatively straightforward. Where do we put these manifests? Well, we could put them in keep itself. We can store whatever we want in keep. But as a matter of fact, we saw them on the Arvados API server, which is really the workhorse of the entire Arvados architecture. This is the one piece where everything comes together to make the entire system work. And so the API server does an excellent job of keeping track of where all the data that's being fed into it comes from. So it keeps track of who provided it. It keeps track of when it came around. It keeps track of whether the data is the output of some process that was run through Arvados. And so this is where, by requiring that all the data running through the system go through the API server, we can keep hold of the kind of information that we need for reproducibility and provenance later. So let me show you a quick example of a job that runs through Arvados. A specific chunk of analysis so that you can start to get an idea of what happens here. Sort of like keep, the API server itself also has an HTTP REST interface and works primarily through JSON. When you want to add objects to the API server, you upload some JSON. And when you request objects back, whether one or several, it'll give you a JSON response. So this is an example of how one job definition might look in Arvados. It starts with a definition of what command we're actually going to run. All of the scripts that Arvados refers to are stored in Git. So it describes some specific Git repository that we're using. The specific version within that repository that we want to use. And also the actual name of the script. In this case, the developer has actually specified a very specific Git commit hash. These are also checksums in Git's case that are SHA-1. We can also specify a branch or a tag. And Arvados would figure out the latest Git commit has the response to that, or corresponds to that branch or tag and do the resolution for us. We then follow with inputs to this specific command. This specific tool that we're running here is sort of a wrapper around existing command line tools. This one's wrapping around the BWA tool from GATK. So one of the inputs is actually the invocation that we want to run. We go on to specify the keep data that we want to run on, or specify that we need at least one reference collection from keep. And we have a default specified in the middle there as well. We also have runtime constraints at the bottom. We have the name of a Docker image. Who here is familiar with Docker? Okay, so a reasonable number. For those of you who aren't real quick, Docker is a containerization solution for Linux and other Unix-like operating systems. It sort of lives halfway in between CH root jails and software virtualization like Zen. It doesn't necessarily get dedicated hardware to it, but you can better isolate the process. Compared to a CH root jail, you can better isolate the process from other system resources on the hosts, such as networking or storage. So we have the symbolic name of a Docker image in this runtime constraint. We also have some information for the job task scheduler that this specific job needs to be run on systems that have at least eight CPU cores. Arvato supports a number of basic parameters for the compute nodes so that you can make sure that your jobs are running optimally. So this is an example of a job that you would upload to the API server, and it might respond to you back with something like this. This is slightly translated because what the API server has done is taken all of our symbolic names and resolved them to stable names that we know that we can refer back to forever. Fortunately, the git version was already specified as a git commit hash, so we didn't have to change that. We actually recorded the specific reference collection that the user requested to run on. So previously, we were saying we need a reference collection. Once we actually launch the job, we'll record the specific reference collection and other input parameters that were provided by the user. We've also taken the symbolic Docker name, the repository plus tag that Docker uses, and converted it into a checksum as well. Docker also uses SHA-1 checksums under the hood much like git, and so we figured out which name corresponds to the checksum and we've recorded it. So now, with this information, if we want to go back and rerun this job, we'll be able to do that using all the exact same tools that we did in the original run all the way down to the C library. So everything, you know, we have the git script with git's information and checksum. Everything below that can be provided by the Docker image, which we've also resolved to a unique checksum. Between those two, that's the entire software stack right there and it's not a lot of overhead for us to record like just these very basic couple of identifiers so that we know that we can come back to this later. So in addition to the reproducibility benefit, we also have, I mentioned briefly, we have a provenance benefit here as well. We have the capability of knowing where information came from so that we can go back and provide that information upon request. So this specific area of how we're going to present this information is something that we're still under development. I'm going to show you a couple of very basic visualizations just to give you an idea of what we're capable of. This first graph, this is a pipeline graph that illustrates the flow of a specific pipeline. You can read it sort of top to bottom. The top is where the pipeline starts. It ends at the bottom. This is our tutorial pipeline. So during the CodeFest, if any of you went through the Arvados tutorial, you might find some of this terminology familiar. The rectangles are specific jobs that ran in Arvados or the small rectangles, I guess I should specify. And then the ellipses in between represent specific inputs to those pipelines. So we can see in this specific pipeline which inputs were provided and which jobs they went to and how all those sort of flowed through the system. We can even see the intermediary inputs. This very basic pipeline just calculates the check sums of some things and then does a grep on them. It's just a tutorial like I mentioned. But you can see the intermediate md5sum.txt in the middle that came from the first job. We record that and we know that it came from that first job and that it was fed into that second job. So we have the provenance of the entire pipeline. But we can even do better than that. We can do better than just you know you ran this pipeline. Let me show you specifically what you ran it with. We can also go back and, given that you're just looking at a piece of data, we can go back and see where that data came from and answer that question for you. So here is provenance graph for a specific collection, i.e. a set of files and keep. You might read this sort of, if you read it top to bottom, the thing on top is the specific collection that we're looking at the provenance for. This collection came from an example of the job that I showed you earlier. A pipeline can just be a series of jobs much like that and that's what we see here. This collection was sort of the output of the second job of a particular pipeline, a GATK pipeline. So at the bottom we see a couple of different keep collections that fed into one job, which yielded an output. We took that output plus another collection in keep, ran them through another job and that's how we got this specific collection. So we can show all of that provenance to you on request. And we can, like I said, the visualizations and revealing all this information is something we're still working on, but we definitely capture it so that we can show you the source, the origin of any piece of information in the entire Arvato framework. So I know that's a lot of technical stuff and a lot to take in. Let me just sort of recap by giving you a very high-level overview of the entire architecture now that you understand. Now that you've seen some of the specific pieces, just to better illustrate how they fit together, you start with your input files, you put them into the keep storage system and then the specific keep manifest will go into the API server and you'll send jobs to the API server as well for computation on those inputs. The API server will farm those off to compute nodes and as the jobs run, the output files those go right back into keep and once they finish, you might have more jobs to run in the pipeline. Those also get instantiated on the API server and we continue that loop through the entire analysis and we capture everything we need for reproducibility and provenance. So that's the very basic architectural core of Arvato in a nutshell. So if you'd like to learn more about it, like I said, there's the website. We also have an IRC channel. It's active, admittedly, usually more during the workday, but we're always around. It's on OFTC, the Pound Arvato channel. Feel free to join us there. And if you're interested, we currently have an install on Amazon Web Services. If you're interested in trying out Arvato's with a beta account, feel free to get in touch with me during one of the breaks or after and I'd be happy to help get you hooked up. And so that's the intro and I'd be happy to take questions. It's time for one question. Well, I'll start with just one first. I was wondering about, because you mentioned something about the provenance and I suppose that's the data provenance, but I was wondering more about the executables and the scripts provenance as well because you can have the same executable on the same version, but maybe on different architectures, then they have different text sums. Are you doing, what are your thoughts on that? On the some tools or the JTK jars itself or the Python scripts or? So we expect that most of the computation that people will want to run in Arvato's will start in a high level language like Python or Ruby. We provide SDKs for a number of languages and we expect those sort of launchers to be checked into a Git repository and so those scripts will generally be architecture agnostic and Git will provide us the information about versioning over time so that we can keep track of that. In the specific cases where you do need to launch architecture dependent binaries, that all will generally we expect to be captured by Docker for us so that either you'll have one Docker image that has all the different versions you need or you'll distinguish between different Docker images to know whether you're running the binary for architecture X or Y. So it may not be as fine-grained as you're looking for, but we at least have some ability to, we have the, provided that you write your Docker images appropriately and we should be able to distinguish which one you are working with. Okay. Can I ask one more question? Yeah, I think we need to move on to the next