 All right, so welcome. I'm Alicia. I'm on NMDC, Montana, and Julia are here. They're also on the NMDC project so they can help field any questions. And we're going to talk about exploring neon data with NMDC tools. Can people use the Zoom reactions to say whether or not you know what the NMDC is? Montana, you know what the NMDC is? I had to find my button. Well, for those of you who don't, you hopefully will know more after. Has anyone here worked with neon data before? Does your lab collect standardized metadata? If you do lab work. And so one of the main reasons NMDC exists is reusing metadata is hard. So has anyone else, has anyone here ever tried to reuse someone else's data? Thumbs up or thumbs down? How did that go for people? They're getting a thumbs down. We're getting a surprise face. A cold face. So in general, it can be challenging to reuse someone else's data. That's part of what an NMDC is hoping to make less painful. So the agenda today is we're going to talk about what the NMDC is and talk about our data portal, which is one of our web UI applications. Then there'll be a 30 minutes, probably 20 minutes to do it and 10 minutes to go over a data portal activity where we'll look for data on the data portal. And then we will talk about retrieving metadata programmatically via the API. And then there's 30 minutes at the end for a discussion questions feedback. Since the group is pretty small, we could definitely adjust this if people have different interests, but this is the agenda that we put together. So after the seminar, you'll be able to understand how neon and NMDC are collaborating to improve fair data. So what the NMDC offers, why fair metadata is important and where to find neon data on the NMDC data portal and the NMDC API. These are the, this is just kind of a reference slide. These are the two resources we'll be talking about today. So the data portal and then the API for programmatic access. So now I'll talk a little bit about what NMDC is and what we're doing with metadata. So the NMDC is a sustainable data discovery platform. We're really focused on fair data. So making data findable, interoperable, reusable. And we're three national labs that have come together to do this. We're in the fourth year of the project and we're really focused on enabling people to find multiomics microbiome data. So the vision of NMDC is to connect data, people and ideas to advance microbiome discovery. And we're doing that through infrastructure standards and community building. There's three product initiatives underpinned by an engagement strategy. So Montana Smith is on the line and she is the lead for the submission portal. So that's a UI where you can enter information about your study and the bio samples that you've collected in a standardized format. Julia is online. She's the product owner of NMDC Edge. And she, and this is a UI where you could run the standardized workflows. I'm the product owner for the data portal. And so that's a place where you can access and discover microbiome information. We have a couple really key partnerships. So we work really closely with the Joint Genome Institute and EMSL. Those are two department of energy facilities that generate a lot of the omics data. So JGI generates a lot of the sequencing based data and EMSL generates a lot of the mass spec data. We have two engagement programs, the champions program and the ambassadors program. And we work with people to improve data standards. Sometimes these folks host workshops for NMDC tools. There can be funding to travel to conferences. So if you're interested in information about that, Julia, I can field any questions there. And then we have a number of key strategic partnerships. What we're going to focus on today is the partnership with neon. Now I'll talk about our data portal and how we're using it for data discovery. So meta meta data standards are important because data, it's difficult to reuse data from different labs if they're not using consistent field names, measurements, or units are unspecified or terms are not being consistently used. So part of what NMDC does is makes terms, field names, units, etc consistent. So data comes in and is formatted to be consistent via the submission portal and then that data, which is now consistent is then discoverable within our data portal. So this is just a screenshot of what the portal looks like and then we'll walk through some of the features to describe them in more detail so that you guys can do the scavenger hunt activity. So on the left panel is a faceted search option so there's several pieces of information that you can look on some of the popular and very basic information that you could sort on is information on the samples so things like the depth that the sample was collected when the sample was collected geographic location either by latitude longitude search or by the location name. We also have a map feature where you could zoom in on different areas and then click search this region and that would populate the latitude longitude coordinates. We have a cake search for functional search so you could enter in the name of a cake term or type in an enzyme name and then select from a dropdown. This would return metagenome metatranscriptome or metaproteomics records that have a hit to the cake term. There's two different systems that we use to support kind of habitat or environmental information. The gold pathways is a five level system that's from JGI. So a subset of the samples to have this information populated and then we also support a three level system to provide environmental context terms. These terms are required by the genome standards consortia which is a body that governs metadata requirements about genomics information. And so we also require this so this will be available this is available for every sample and you can search on this. This is really useful for searching across studies or subsetting data within a study. You could search by what type of omics data you have so there's a couple different options here. On the left panel you could search by the omics type. This is going to be an or query so if you said you wanted metagenome and natural organic matter. It would be the super set of both of those types. On the right we have an upset plot and this is going to be an and query so 134 samples have both metagenome data and natural organic matter data. Below kind of the map and the collection date filters is going to be information on the NMGC studies. So you can click on this button here to filter records that are only part of this study. Or you could click on this arrow over here and that would lead you to a detailed study page which has quite a bit of information. So if we go there. This is an example of the study page so it has a title, a description. This has what type of omics data it has and how many of those it has. There's information on who's in charge of the project, we have links out to some of the neon resources, who's on the team. There are persistent identifiers throughout an MDC so we create identifiers from everything from studies bio samples workflow runs and then identifiers for files that get created as the output of workflow runs. Information on funding sample counts. If there's things like related publications, etc. links to other resources. If you scroll down below the studies you'll see the samples that are part of that study. So similarly you could click on this and navigate to the sample details page, or you could click here and that would expand and show you the output from the workflow results from the analyzes the metagenome raw data. So here's an example of a samples page. Good see basic information about latitude longitude. When it was collected, what the environmental context terms are over here we have a field called bio sample categories. So we do have some studies within an MDC where the samples come from neon sites, but weren't collected by the neon organization so other PIs have come onto a neon site and done some of their own sampling so you might be interested in all the samples that are from neon sites, regardless of the type of omics type or regardless of who collected the samples and this would be one way where you could return those samples neon, neon soil sampling protocol pool samples at a plot level. So we have 40 by 40 meter plots on different sites and they take one to three cores, divide them by the soil horizon layer into mineral or organic, and then pool them and so what we listed here is this is the bio sample. So we have two other bio samples that are from that plot that got pulled together for the metagenomics analysis. We click to expand and see some of the workflow data. You can see what, what type of workflow this was so here we're doing read quality filtering, metagenome assembly annotation we also do a brief base taxonomy analysis. This is supposed to make the data more accessible not everyone has access to some of the compute resources that are required to analyze metagenome data. And so if you wanted to you could put in whatever filtering you wanted and then you could download download that data via the UI. So I'll walk through an example of just like what some of the counts look like and what they mean on the data portal so things hopefully make a little bit more sense as we're doing the exercise. So here we've applied a latitude longitude search for Hawaii. And 90 is the number of samples so everything here. The fact the faceted search is always first going to return the number of samples. The same in the map appears the number of metagenomics records that are associated with those samples so this should be about three to one. This is the number of samples that use that omics type or that combination of omics type so in this example we only have metagenome results down below is going to be the number of studies that have been done in this region. And then here is a separated by each study and omics type how many metagenome records are there. This appears how you're going to clear your search filter so you just click on this icon and that would work that's going to reset the data portal. This is going to be important in the data portal activity. Now, it's time for the activity. So there's a QR code, or you can follow the link. This is the link to the data portal. And then this is the link, the link in the QR code for the questions that we're trying to answer. So, we're going to probably take about maybe 20 minutes to do this, and then we'll go through the answers. Or we could kind of go through it together I know there's only handful of people here. Does anyone have a preference of doing this together versus individually. It's a group maybe it makes sense to do it together. Okay. So let's go. Pull up the questions. These are questions. And then I'll just kind of I guess toggle back and forth so there's three neon data products that have metagenomics. So there's soil metagenomes, benthic metagenomes and surface water metagenomes. And NMDC is currently hosting the soil metagenomes and the benthic samples and then we're working on getting the surface water samples in. So this exercise primarily focuses on the soil samples because that's what the science talk was on last month. So the first question is going to be how many samples are from the study entitled National Ecological Observatory Network soil metagenomes. And then this is the neon data product ID. I'm just going to highlight this go over to set this here. I'm going to type over here. See if I can find the right study. So I scroll down it has selected this, it's filtering for this study. And so this is going to be the number of samples of 4,443 samples are associated with this study. So all of the samples I think failed in library construction or whatever so the number in the upset plot here is going to be a little bit different. So there's 4,411 records that are associated samples that are associated with the metagenomics sequencing record. Can I ask a question. Yeah. What's that 1560 number that appears with the study because it looks like that's not the number of samples. This is the number of metagenome records so this is. So, if we, if we look back at the, what happens with the pooling one to three samples, typically three samples per site get combined together for the sequencing. Okay, so that's what this count is. So the system retains the fact that those came from multiple samples and that's correct. Yeah, higher. Okay. So if you, if you look at this record. This is, and you click on the sample you can see this is the sample and this is the other sample that got combined with it. So what's really on names there are samples for the soil samples is it's the site, the plot, whether it's mineral or organic horizon from the from the core. And then these are XY coordinates within the plot, and then the collection date. So you can see here that the sites the same the plots the same, the soil horizons the same. There's different, they're different coordinates within the plot. And then they have the same collection date so these are these two samples were pulled together. Yep, and the DNA was extracted a library was made, and then sequencing was done. Cool, thank you. So here you can also see that this is the, this is the analyzed data. So then we picked up neons raw sequencing metadata, sorry raw sequencing files, ran quality filtering assembly, etc. So this is the sample that this is for. But this is the other sample that's in there. Based on the NBC identifier but if you click on this, you'd get to the other. This is the other study, the other sample page, but they're linked. So this is it's kind of nice to be able to to see how the how the records tied together. Any questions on that. Okay, so we have 4,443. That's the answer to that. Let's see. What type of omics data does an MDC has have for this study. So there's a couple different ways we could look at this. We can just look here and see there's only metagenomes, or over here, metagenomes or up here in the bar chart so it filters out and there's no match for the other omics types that we host. Or you could go down here to omics type, and also see that there's only metagenome data. The next question is, what instrument types were used to generate the data for that study. So we're going to go back here. And one of the things that you can filter on is the instrument name. If we click on instrument name, it's going to show the different instrument types that were used. So there, it's been sequenced on three different aluminum platforms so that next to Nova Seek and High Seek. These are the, the samples that were most recently sequenced at JGI. So these are of much higher depth. So therefore have like larger assemblies and are going to have more metagenome bins. So we're going to look up what the environmental, local environmental context term is for one of the neon samples. So if we scroll down and we go back to one of the samples we were looking at. So we use, these are terms specified by the genome standards consortia and it uses the Envo ontology primarily, which isn't what neon uses natively. And so what we've done is we've created a mapping table between the national land cover database terms that neon uses and the Envo terms. And then that makes the habitat descriptions descriptive terms interoperable with all the other samples that we have. So here are the environmental local context is area of evergreen forest. Everything's labeled as a terrestrial biome. All of the soil samples are labeled as soil and then their local environmental context term is going to be different depending on where the sample was collected. And then we learned about the related samples so what samples are related to that sample. In this case there's one other related sample. It's the other one from the same plot. So now the next question is to get the NBC study identifier for for this so we would want to go over, as well as the funding source so we'd go over to the study page. There are several different ways you could do this we could go back to where we were to the main landing page or this, the study identifiers are listed here and this is a link out to the study details page, the click here. So this is the NBC identifier for this study. And then this is the NSF grant number. So I'll put an example of using the some of the map and geographic location features so how many samples are from Puerto Rico. So I'm going to go back. I'm going to clear all of my search by clicking up here. And then I like to use the map feature so I just zoom in on Puerto Rico. So there's a couple different places where we have samples from on Puerto Rico, but to say search this region. And you can see that there's 364 samples that are from Puerto Rico from two different omics types. So this is, we have metagenome data. We also have natural organic matter data. We scroll down we can find out that these are from four different studies. So grow was a project to catalog watersheds globally. And so there's, we have two metagenomes and 10 natural organic matter matter projects. This is a project to study carbon carbon redox in a forest in Puerto Rico. And then these are samples from both neon benthic samples and neon soil metagenomes. The next question was how many of those samples are from how many of the samples from Puerto Rico are neon soil samples. So to do that we would click back over and then just click here to apply this study as a filter. You can see it adds it as a filter up here. And then our results drop down to 194 samples so 194 soil samples from Puerto Rico. Alright, next we're going to look at some other studies so we're going to look at this study which looks at permafrost in Alaska. I'm going to go back and clear all the search terms. And then I'm just going to search for this study title. So this is our study selected. So what type of omics data do we have so this is a natural organic matter. You have 241. It's one to one here the samples to the data, the instrument data. And then if we click on one of these samples will be able to figure out whether or not it's from neon so this is not a neon this is not a neon data product. This is individual PI. But they are doing sampling from a neon site so you can see that from the biosample categories. And then the last question was just to figure out how many samples of an environmental local environmental context of area of evergreen forest. So if we go back to the main page clear all of our searches. So we would go to the info section and then look. Click on local environmental context and put in the term. And there's 1266 samples that are that have this environmental local context from a couple different studies just kind of fun. So that's the end of the training on the data portal. Were there any questions before we talk a little bit about the UI. This is great I I didn't realize that the samples that were collected at neon sites but not by neon were labeled that way. So that's awesome. Thank you. Yeah, this is something that we did for essay. And actually I'll show you. There's no way currently right now to search. To get all of these via the UI, but I will show you how to do it with the API. All right, so I'll go through the API slides and then we'll we can go around a little bit with that. So back to the back to the slides. So API stands for application programming interface. It's just an intermediate so that you don't have to know what the technology stack is on the back end. It's a pretty common analogy online from what I found that this is like comparable to going to a restaurant and talking to giving your order to the waiter and the waiter communicates with the kitchen. And then the waiter brings back your order. So it's just an intermediate layer that is consistent and that you can, you can access programmatically. This is how you can get to that NBC API. So there's a web link and the QR code. The swagger UI does have some auto generated documentation that can be pretty minimal. So we've also put together some more user friendly kind of written by human documentation with some examples that might help you if you're if you're having some questions. And then depending on what kind of endpoint it is, those are color coded. And that's kind of just automatic. So get endpoints are used to retrieve records. So that's mostly what we're interested in today. Internally, we also have some post endpoints where we can upload information. When we're adding records, our biosample records workflow records, etc. But we're going to focus on some of the get endpoints today. So a couple of the ones that we might be interested in our endpoints about studies and or biosamples. So these two endpoints here specifically like just take an identifier. And then these two endpoints are, you have to know a little bit more about how you might want to query those. The study endpoints. We can take the study ID that we got from the data portal exercise, put it in here, and then get all the metadata back about the study. So if you could come in here, there's a button over here that says try it out, you would put in the study ID click execute, and then it would return a document adjacent document. And then you can, you could download that, or you can use curl or Python request library, or our whatever whatever your preferred choices. So here's an example that's a little bit more complicated. So part of is how we aggregate is how we say that biosamples come from a certain study. So if we were looking for all of the, all of the samples that came from the neon soil study. This is what the query would look like for this endpoint. So you would say part of colon, and then the mdc study. And then you can get back adjacent documents that had some metadata and then some results. So this says what the filter is. This is how many samples there are so this matches what the data portal, what we found in the data portal. And then by default, it's going to return 25 records you can change this if you want. So you might have to do some paging so you'll see an example of that in the Jupiter notebook that will go over. And then in the results is where the first 25 records are and then you would have to keep fetching them. I'll declare this is what I was talking about. So this is, if we wanted to look at a different study, or just in general look at any samples that come from neon sites. On our end that you would use the field name of biosample categories and then a value of neon to do that. So here I've intersected it with a search on another study so I got that identifier for the 1000 soils research campaign. This was a project out of emzel to catalog some soil samples across the US. It was a pilot study for a new, a new call that they have called Monet, which is to catalog soil samples across the US. So you could do this as a, as a combined query or just use the second part of it, but how we're going to do a more complicated query here is you just connect them with a comma. So here we're looking at for samples that are part of this 1000 soils study that have a neon biosample categories of neon. So this would return all the records just from this study. If we just put in the second half. Oops, if we just put in the second half it would retrieve any samples from any studies that are from neon sites. The NMDC schema is written in a framework called link ML and the lingo, the terminology is a little bit different. So I'll talk about that a little bit, but there are some really nice documentation where if you were using the API and you get your JSON back and you're not sure what some of the field names mean you could come in here to look up the documentation. Next, come to the repositories documentation page and type in what the field was and then you could learn more about that the link ML nomenclature is slot. So that maps to what you would call like a field in a JSON document. So here I pulled up the documentation about pH so we can just see what this kind of looks like. So this is a JSON record for a biosample from 1000 soils that's from neon. It has a pH of 6.5. So, let's say we're interested in learning more about how an MDC defines pH. This is a pretty basic field but it's just illustrative so here would be what the slot name is the pH below that is the description for the slot here applicable applicable classes would tell you what type of records. This, the slot applies to so here it's a applies to biosamples. And then you could look up if there's any restrictions on what the value is under properties. So maybe this needs to be a Boolean or you know an integer or a string or whatever. So now we'll hop over and look at what kind of putting together the API with using Python and a Jupyter notebook. The question that we're going to try and answer is, do the mineral or the organic soil horizon samples have a higher pH at the Wind River experimental forest, which is in Washington. So I put a link, tiny link here. And Jupyter notebooks that was developed internally by NMDC staff there, there's not enough time today for us to kind of write this together and so this is kind of mocked up, but gives you some idea of how, how you could use the API and Jupyter notebooks. Just wanted to note that you could also do this with the neon R packages but we just wanted to illustrate how you do this using NMDC directly. And then there are going to be some things like the workflow results those are those are hosted in NMDC and been pushed over to JGI. So there are some things where currently the neon R packages wouldn't be able to provide the information. But for this you could definitely use our packages so this is just illustrative. And, you know, import a bunch of packages including the, this is the Python library to to be able to make API requests. This is what your API request would look like so this is the base API URL. This is what you're filtering on the name study study name, you're filtering on the neon data product ID. And then here we're like looping through, remember I was saying that there's a max number of records that get returned. So we're doing this in a while loop to fetch all the records. And then we're parsing through and getting the geographic location name. We scroll down a little bit. We can get to the map. So this is this is pre populated but it's it's rather complicated. But, so if we zoom in on Washington, there's two sites so we'll have to click and see which one is Wind River. This right. Abbey Road now that's not the site we want this it is. So this is the site that we were interested in so Wind River experimental forest. And we've plotted the pH over time so by the collection date and then colored it by the soil horizon. So here you can see that in general the mineral horizon samples have a higher pH. Any questions. Okay, let me switch back over. So, here and I was able to export that plot and copy it in here. But you could look at, you know, other sites. You can do whatever you want in the Jupiter book. This is a link to the repository the MMDC repository that has some of these notebooks. So if you wanted to look at the Jupiter notebooks that are pre pre run pre compiled you could do those here or you could open them in a Google lab notebook yourself. If you wanted to make edits you would have to clone the repo and and and go from there. So that's kind of all that I had. And now we have time for questions discussion we do have a survey that we hope you'll take. Just a little bit of feedback to let us know what you learned what you thought was valuable. What else you'd like to see from the NBC neon collaboration. Yeah. So let's see who's there's a couple people here from the NBC team. On Tana's here. I mentioned she's in charge of the submission portal. Julia is in charge of MMDC edge and then Catherine and Bryn did most of the Jupiter notebooks. So if any of us could be old questions.