 Hello, I'm Lucia Carbone from Oregon Alpha Science University in Portland. I'm Adam Ark and I'm from UC Berkeley and Berkeley Lab. Can I start? Well, we can start, right. Right, so welcome to the next session. We have a great lineup of people who have studied comparative genomics from microbes to mammals. And we're going to start the session, I think, with Bonnie Hervitz, who is a professor at the University of Arizona, a commercial biologist, and someone who has been leading light in something called iMicrobe that is one of the more interesting ways of putting data, diversity of tapes together. Okay, great. They're still getting at my slides. So first of all, I just want to say that I'm really excited to be here. My first love and passion in the world was comparative genomics and, in fact, in the plant science world. So it's really fun to be here and to hear about all of the advances that have really been moving in the field. Today, my talk is actually going to be more about an entirely different world. And in fact, we're going to go on a deep, dark exploration of the ocean sciences and connecting data sets from large-scale oceanographic cruises with omics data and environmental data and a whole mixture of different things together. And as part of this journey, before you start tuning out and say, oh, my gosh, the lull of the sea is putting me to sleep, what I want to argue on the behalf of metagenomics and the microbial community is that some of the things that we have experienced in the world of metagenomics is just an amazing amount of data, such as you have as well. And so navigating those seas of the data, in addition to really trying to do all of this high-performance computing and, at the same time, really moving into this world where we need to bring together disparate data sets, so joining the islands of data, as I like to call it. So today I'm going to talk about a project called Planet Microbe that has been funded by the National Science Foundation. And it's about reintegrating data sets from these big, large-scale oceanographic cruises that have been funded by the National Science Foundation. But like any data science project, we often find ourselves starting off in a bit of an isolated place, on a bit of a journey where we oftentimes can only see just a subset of the data, which is part of a much larger iceberg, a huge data set that we're ultimately going to uncover and work towards discovering, but we can't really see all of that information just yet. And so one of the main problems with that is as we start to analyze the data sets, we oftentimes uncover all kinds of issues and problems and reintegrating those data sets that involves us actually trying to do a lot of cleaning, so get out your brooms, before we can actually start working with the data and doing the innovative science that we want to do. So in my world, we have about 35,000 marine metagenomes, which is fabulous and a huge repository of metagenomic data, but the problem is if you actually go and dig into these data sets, there are seven different formats for latitude-longitude. There's 27 different collection dates and times, over 830 other kinds of metadata fields and none of it is consistent. So if you want to do a global analysis of information about the world's oceans, you can't possibly do it, given the data that are there initially. So one of the things that I like to think about is once you have actually gone through and really done this process of cleaning and screening these data sets, one of the things that is important is once we've had this finely milled flour, this beautiful sugar cane, all of these pieces together, we want to put it together into a beautiful cake, a data cake, where we're integrating all of those pieces together. But I actually see it a little bit differently, where in my world, my kids and I actually watch the cupcake wars a lot, and I feel as a data scientist that that's actually the model that we need to move towards. So rather than having a prescribed set of a recipe where you have very specific ingredients and you follow someone else's recipe as to create a particular cake or product, instead, the next generation of data science really needs to be able to pull together ingredients from a list of things that you might not have originally anticipated. So in particular, maple, bacon, cupcakes could be a great creation. And so as we start to think about this, we need to be able to pull together data sets from very different kinds of places that may be associated together that we don't actually know at this time. So there has been a huge push towards fair data principles. And what fair stands for is findable, accessible, interoperable, and reusable. And these are fundamental to data integration. And in particular, I highlighted a particular field here which talks about machine readable. So not only do the data sets need to be interpretable to humans, but they also must be machine readable in that we can actually bring together those data sets using the power of computing to do all kinds of great machine learning and other kinds of algorithms to find patterns in the data that we ourselves as humans might not recognize right away. So let's use the computers to help us out. So one of the great initiatives in the microbiome space is all of the folks in the data science world have kind of come together and held hands. So really we have a group from the Human Microbiome Project. We have the Joint Genome Institute and DOE and all of the environmental data sets. And then us as well in the ocean sciences really trying to pull together these data sets and look at them in a way that allows us to really create these standards and high-quality data sets that we can use and interoperate with one another. So today I'm going to give you just a quick perspective about what that actually looks like using through a project that I direct that's called Planet Microbe. And the main idea of that particular project is that we want to reintegrate the omics data sets. We want to use standardized semantics for increased data interoperability. And then we want to actually be able to analyze and drive analytics on top of those data sets. So one of the unfortunate things about research cruises is that although it costs $40,000 a day to drive a boat and pick up samples, the oftentimes the data once they have left the boat gets separated from one another to just for the pure fact that there's different disciplines involved. So chemists, genomicists, and etc. And those data have not been actively reconnected back together. So that's what my project is working on. And there are also many different representations of data from many different cruises. So you may remember Craig Ventner sailing around the world in his yacht collecting samples and doing some of that first global ocean sampling effort. But there have been many more associated projects since then. And everyone has their own way of annotating things that they think are the best. So what we're sort of doing is doing what I call Gattaca, which is taking the best of everything and all of them, and then reconnecting them back together. So one of the things that I like to think about is that when I think about data, I always think of them as characters, like some sort of perspective in a given story. And we might not know everything about the character and we sort of uncover that as we go along in our scientific stories. But one thing to realize about characters is that there's main characters and then characters that are sort of secondary to the story. But oftentimes the perspective of the character is really important. And so as we start to analyze data sets, we might have different perspectives that we might look at in terms of data sets. So in the book, Moby Dick, there's the perspective of Ishmael, but then there's also the perspective of the whale. So in the world of omics and sequencing in the ocean, we might have the perspective of the chemist or the perspective of the microbial genomicists, and each of them is going to offer different contributions. And so by being able to connect those data sets together and let them use those data sets together, we can actually start to promote interoperable science. So here's a quick example just to show you just some of the nuts and bolts of the problems that we deal with. One of the things that we have, for example, is the concentration of nitrate in water. And nitrate is represented in so many different ways in all of the ocean science data sets. But what we do is we bring it back to a common ontology using EnVo. Then we also bring it back to the same units or semantics using a unit ontology through the oboe foundry. And then finally, because we want to know a lot about the equipment that the data were sampled on, we also are linking it back to protocols that are open source within the domain so that people can not only understand and integrate data sets using common ontology and semantics, but at the same time, they can understand how those data sets were collected and how they might be used in the future. Another part of interoperable science is actually putting data into a format that can be reused. And one of the kinds of what we're doing in our project is we're using frictionless data packages. And what that is is essentially sort of a time capsule of goodness for data. So what we're trying to do is actually put the data sets into a package that they can be reused by someone else without having to talk to me or the people that generated them. And so what we're doing is we're taking information and putting it in a structured data file called a JSON file, and then about the type of data, the units, and then about the metadata, connecting it back to some comma separated values about what those metadata are, and then connecting to where the actual raw data sets exist. So what that means is we can reconnect all of these pieces together later and interoperate those data containers without together as a community. So what I wanted to show you, which I can't because this doesn't have the video, but what I wanted to say is one of the primary points of data integration is about finding touch points between data sets, finding those pieces that are incredibly important that you need for joining things together. So in the world of ocean sciences, these touch points are depth, time, latitude and longitude. And so what we're doing by taking all of those data and then making all of the metadata part of this common ontology, we can drive these searches that allow us to look for ranges of latitude and longitude, ranges of depth, ranges of temperature and find all of the samples that we might want to use and operate on for a given analysis. So the next thing I wanted to tell you about, here we can go forward, maybe not, hope, back to you here. Yeah, it seems to have gotten jammed. This is what I was worried about. The funniest part is like, I'm a data scientist and this just doesn't work. So anyway, and the mouse isn't here. So let me just escape out, see if we can go try again here. But just in the interest of time, the next thing I want to tell you guys about is actually using packaging code in a similar way that we can package data. We can also package code. And so one of the things that my group uses are containers. And so again, this is another time capsule of goodness, where essentially what we're doing is we're packaging the operating system, we're packaging the code, we're packaging any dependencies or modules that are needed for that code and we're putting it into a container. And that container, what's useful about that particular container is that it can be run anywhere, it could be run on my laptop, it could be run on exceed resources, it could be run in the cloud. So ultimately, this project and what we're really pushing and driving is creating modular components that are in and of themselves products. So if, for example, I get hit by this Tucson streetcar, which I actually almost did recently, it was raining. Anyway, one of the things that we're producing is all of these containers, data containers, Docker containers, all of these pieces that can interoperate and be reusable with one another in and of themselves. So an important piece of that is some code that we call the appetizer. So again, it's creating this, it's creating a JSON file that describes all of the information that that particular code needs to run. So what are the inputs, what are the outputs, what are the parameters? And when you do that, what's really fundamentally important about that and about connecting that to a container is now you have a description of that particular tool that you can embed into a website, any website. We embed them into iMicrobe, but they could be really embedded and reused by any source who is interested. And so the other piece is that we can not only connect the information about the inputs and outputs in the containers, but we can actually connect them to real computing resources. So we can take those, that architecture, we can make it run on Stampy 2, which is an amazing cluster available at Texas Advanced Center for Computing. And then that takes a lot of the know-how away from users who might not have all of those computational skills themselves. They can use this web-based format to run the tool and do all kinds of high-performance computing. And in particular, my team works on a lot of algorithms that use really advanced architectures, such as Hadoop. And those are not really accessible to everyday people, but we can make them accessible through this format and through containers. And by connecting up those containers with the necessary resources. All right, I'm going to jump because I think the next slide has a video and I might jam up again, so we're going to start here. So what I'm going to sum up and tell you what I told you today. So with data integration and tool integration, one of the key components of creating this insight is really about moving towards the cupcake wars, moving towards this place where we create these objects, these containers, these data containers that can inter-operate with one another and can be reusable across fields and are also machine-readable for us to explore using more advanced analytics. I also told you about some important projects that are working towards making data fair. In the world of the microbiome, we're doing... There's a lot of work that's being done. Currently, the DOE has put forth $10 million for a lot of those energy-based crops to help us really start to put the microbiome data sets together in this really organized way. My group was recently funded from the Gordon Betty Moore Foundation for the ocean science data sets, and I know a number other funding agencies are also stepping up to really put money into this collective analysis of microbiomes in a way that meets fair data principles. And then I told you also about frictionless data packages. Frictionless data packages are amazing because we can package our data like we do our papers and really put some pride in it and make them reusable for people who are interested in integrating data sets. And then lastly, I told you about our adventures in containerization of code. By containerizing code, we no longer have this problem that every postdoc has run into where you no longer have... You can't access that operating system, you can't get the code to work, you download it from GitHub, it doesn't work. So all of that frustration is gone when all of a sudden the code is containerized. And then with that, I just wanna say thanks to my team who does a lot of this work and my collaborators at the University of Hawaii and also Lawrence Berkeley Labs as well. Thank you.