 All right, good afternoon. Let's go ahead and get started. So this workshop will be presented by Marcel Ramos. He's a senior data scientist at CUNY, City University of New York, and he's a member of bioconductor core team, probably about five or more years. So Marcel, please take your time. Thank you, Mikhail. I have some technical issues with my mask. Thanks everyone for coming, and I hope you're having a good time at BioC, and you had a good lunch. Let's get started with our workshop. I'm going to go to a tiny URL website, so that's tinyurl.com slash multi-assay workshop. So I've been working with bioconductor for about five years now, and I've developed a number of packages, including multi-assay experiment, curated TCGA data, and TCGA utils, among others, that are in the works and are not part of this workshop, but this was my first project when I came into the Waldron lab, and it was a challenge to integrate these different data types that I knew nothing about, and thanks to the support from the working group, Vince Carey, Martin Morgan, Lee Bi-Waldron, and others who provided input, you were able to get some data structure that would coordinate different data types into a single container class. So if you want to access these slides, you can go to this URL. It's tinyurl.com slash multi-assay workshop, and we'll be using the orchestra instance, which is linked in the third slide. So I have a package now website built from this workshop, and the instance you can see here under the orchestracancerdatasci.org website. So I'll spin that up myself just to show you how you could get started. I don't know if I can log in with email. I'll try my Google account, I guess. Hopefully you all have access to orchestra by now, but this is one of the parts of working with someone else's computer. Sorry about that. Well, that's loading. I'll direct you to our webpage for the workshop. So this is the main page where I have a description of what we're gonna cover as soon as this, if we can get to it. Otherwise, I can just show you the built page. Okay, so what I'm gonna cover are some of the packages that use the multi-assay experiment framework to ship mainly TCJA data, and also data from the C-Biogenomics portal. And then I will cover the main data classes that are used throughout those packages, mainly summarized experiment, multi-assay experiment, and ragged experiment. So as you may have gathered from the conference, bioconductor uses a framework that's integrated across packages to represent data and analyze data. So summarized experiment is one of those major containers that we use to represent and analyze data, and I'll show you how to integrate those with all of those with multi-assay experiment. So some of the packages that we'll use are listed here, and I will go over what these packages do in summary, go over some of the classes that are involved, and then we'll try a tutorial and build a multi-assay experiment from scratch. And I'll introduce some of the data-centric packages like curated TCJA data, C-Bioporto data, and Terra TCJA data, which is a more recent addition for providing data sets on the Terra infrastructure. And then we'll practice a bit on how to work with the multi-assay experiment data class, how to manipulate it, and how to carry out some subsetting operations and some reduction operations towards the end. So I'm going to go into this tutorial button at the top of the webpage, and actually, sorry, it's the reference button. Oops, it's the main workshop button, sorry. This main workshop button is the meat of our workshop here, and let's get right to it. I also have the link to the Google Slides in case you missed that, and we have some options for running this workshop. Mainly, we're using the orchestra webpage. So yeah, this URL? Yeah, tinyurl.com slash multi-assay workshop. Okay, so now I have access to the orchestra, so I'll show you how to launch it from there, if you haven't already. I'll just search for multi, and then on the bottom, obviously we have some people that have beat me to it, so click the launch button and wait that for the load. Hello. So an overview of the packages here, I have a table, multi-assay experiment, like I said, is the main container class for handling different types of omics data, like mutation data, copy number data, methylation data. So it's really flexible in what you can put into this container, and this multi-assay experiment package is the main infrastructure package. Curated TCGA data is the product of a pipeline that we've worked on at the Waldron lab that takes the TCGA data sets that are mainly HG-19 data and process that data to some extent provide some curation to it and re-exports the data through Experiment Hub, so it gives you access to about 33 different cancer data types and they're all in multi-assay experiment format, so it makes it really easy to take TCGA data and import it quickly onto your laptop for analysis. Terra TCGA data is our new addition that uses the Terra cloud service, the Anvil Terra cloud. The Terra platform has some pre-packaged TCGA data. I think it's also HG-19. There is some data sets that are CRGH-38, but that one we're still working on, it's not as easy to import as the other HG-19 data, but it allows you to work within that platform and import the data as a multi-assay experiment relatively easily. And then C-bio-portal data will allow you to pull over 300 data sets from the C-bio-genomics portal, so we did this in collaboration with Memorial Sloan Kettering Cancer Center. So they revamped their API and we are able to download their data based on what the user requests, what data types they want or what studies they're interested in. And we'll go over how to do that within this workshop. TCGA utils, as you may have seen in the sticker, is more of like the tools that will help you work with the multi-assay experiment container. So I like my stickers. I think they're pretty cool, but so I only worked on the TCGA utils sticker that you can get the idea of what we're trying to do here. The multi-assay experiment, like the integrative burger and curated TCGA data dishes out all of the sliders and TCGA utils helps you cook things up. So that's TCGA utils and nutshell. And single cell multimodal allows us, allows anyone who is interested in multi, in analyzing your data as a multi-assay experiment to pull single cell data from experiment hub. And this is a project that we've collaborated with other labs to have their data sets published and made easily available through experiment hub. How many people know what experiment hub is? Okay, so experiment hub is a way to publish your data in bioproductor. So if you're interested in, if you worked on some data and you'd like other researchers to make use of it, the good idea is to put your data on experiment hub so that other people have access to it. And it works currently on the AWS cloud, so but we're moving that, making that more distributed within bioconnector. But yeah, the gist of it is to publish your data on the cloud and have other researchers download via the experiment hub package. So now I'll describe what the multi-assay experiment data class is. So it's modeled after the summarize experiment and summarize experiment takes genomic gene expression data as a matrix and integrates that into these different aspects of the data. We do the same thing, but with more with different data types. So you could have mutation data and expression data in the same container. And they could be of any shape or size. As you can see on the bottom left here, we have what we call the experiments. And these are different shapes. So you can see that some of them may have more observations or more rows and some other data sets may have more columns. And that perfectly works fine with multi-assay experiment. So in the experiments, the representations here are each row is a feature and each column corresponds to a sample. And then the call data aspect of the multi-assay experiment allows you to link the data, the sample data with any patient data that you may have. And then the sample map, lastly will coordinate everything in the container so that things are traceable and subset and makes it easier to subset the data in one shot. So C-bio-portal data as I mentioned earlier has this API interface with the C-bio-portal website. So it downloads data. It doesn't use Experiment Hub because it's mainly downloading data from the C-bio-portal. But it does use caching. So that helps with re-downloading and overuse of their service. So we're grateful to be able to use and download their data. And the interface, we take a lot of care in how we develop our software. We want it to be accessible and user-friendly. So we have about two main functions for C-bio-portal data, depending on what kind of data you want to download. But if you wanna see what studies are available via C-bio-portal, you should look into the get-studies function to list all of those. So curated TCGA data comes from the Cancer Genome Atlas and the Pipeline. It makes use of our TCGA toolbox, which is a package that downloads data from I think GDAC directly. And then the Genomic Data Commons is one of the package that we also support to some extent. And these are alternative tools to getting TCGA data, but they're not as integrative as curated TCGA data. So there are many ways to get TCGA data, but our package provides an integrative representation that makes it easy to get started with your analysis pretty quickly. So we have some reference vignettes in this website. If you scroll to the top and go onto the other vignettes, we have the listed studies that are available via the curated TCGA data package. And we also have an explanation of what omics types are provided in the package. And here I added some notes about Terra TCGA data. So the main website for the Terra platform is app.terra.biol. You do need to sign in with your Google account and you do need a credit card because it's a paid service. But you can select the latest RStudio and Bioconductor and since thanks to Natesh who's been working really hard to get those updated, keeping those up to date. So you can use that image to get started and then make sure you authenticate to Google Cloud and you can check that with this Anvil GCloud exists. And that should get you 90% of the way there. And then from there, you can find TCGA workspaces to list what workspaces have TCGA data. So a workspace in Terra is sort of like an instance that has pre-populated data. So these workspaces have the different TCGA datasets preloaded. So you can explore which workspaces you want to use using this fine TCGA workspaces function and then start to download and explore the data with the Terra TCGA data function. And then we have single cell multimodal. So single cell has been exploded, has exploded in popularity quite recently and we have some collaborators that provided SCNMT datasets, 10X multi-ohm data for analysis. It includes seek fish, site seek and scope two among others. So if you're interested in those data, you can check out the single cell multimodal package. We do provide them in HDFI format and matrix market format. And last but not least, we have the TCGA utils package which allows you to manipulate these multi-ohm experiment containers. So here we have a schematic of how all of these packages, more or less, are working together. You have the cancer genome atlas here which went through the Broad Institute's GDAC fire hose pipeline, then through our TCGA toolbox and then through our pipeline to process that and then use multi-assay experiment to redistribute the data. And you can see that C-bioportal takes data from C-bioportal at MSKCC and other. Can you put your mic if you're online? Who was that? I don't know. I don't really know how to work the WebEx code. Yeah, but if I scroll up, that's the top of the screen on here. Yeah, I can't. Yeah, if you can hear me, can you mute yourself? My Chinese is not that advanced. Okay. That's it. Ping, can you hear us? Please mute yourself. Yes. Can you mute yourself? We can hear everything you're saying. Thank you. Thank you. Nice intermission. So I'll continue here and cover the major data classes that are involved. Mainly, how many people are familiar with summarized experiment? Okay, a few of us. So summarized experiment has a similar structure to multi-assay experiment, right? This one came first and it allows you to have some row annotations that correspond to genomic coordinates. So you have the features like can be genes, right? And you can have the G-ranges object to represent those genes. Same thing with the call data. If you have patient data, that's where that would go and it has some facilities for metadata as well. So this is a major bioconductor class so I recommend everyone to get familiar with it. It's used a lot in several packages and that's why I think bioconductors so integrative is because of these data classes, summarized experiment that make it easy to move your data across packages in a cohesive way. And then more recently we've had the single cell experiment data structure which is similar to the summarized experiment and it works mainly for single cell experiment data. And then we have ragged experiment which is our main data representation for representing copy number data, mutation data such as the data that you would find in VCF files, another array schema for genomic, other ragged array schema for genomic location data. So it's a similar representation to G-ranges list but it gives you a matrix view of the data so that it fits nicely with the requirements for the multi-assay experiment. And I didn't mention the requirements for multi-assay experiment but the basic requirements for putting a data class into a multi-assay experiment are that it needs to have dimension names so you need row names and call names and the class needs to have a bracket method so that you can subset the data within your rectangular data structure. So they're pretty simple requirements and that's why multi-assay experiment so flexible even with the new data classes such as spatial experiment and single cell experiment. So more about ragged experiment. We have these different, these main methods for working with the data that's represented as a G-ranges list. You can take the data and make it sparse or find any ranges that are the same across the different samples and make those a little bit more compact or you can have a window of interest of ranges, a region of interest that you can reduce your data to so that you can say if you have a set of genes that you want to reduce your data to you can use that to make that more compact. And then lastly we have a disjoint assay method which splits your data by all of the ranges found in the data class. So the secondary class that we created called the match assay experiment is similar to multi-assay experiment but the only catch is that all patients have to have a sample in each assay. So multi-assay experiment is a little bit more relaxed in that you can have any number of samples per patient but for the match assay experiment it would require everyone to have at least one sample in each assay. So it's good if you're only interested in people who have an observation in each assay. So and what I mean by assays are these squares like the data types, mutations, methylation. So if your requirement is to have all of your patients have data and all of those data types then you could use a match assay experiment to represent that. So you could create that by coercion. So you'd use like this as method and the name or you can construct, use the constructor function on that existing multi-assay experiment. So now let's try to, let's build a multi-assay experiment from scratch and one way you can do that is by using our interactive demo that we have up in the tutorial section on the website. So if you scroll up on the tutorial top page we have build your first multi-assay experiment and that this will take you to, it embeds a Shiny app that allows you to, I'm not sure why it's grayed out but looks like it's working. So this Shiny app is sort of like a tutorial which is really neat for working through workflows like this. I'll try to refresh and see if that goes away. Okay, so as you go through this tutorial you can learn more about how to work with multi-assay experiment and it has some interactive coding that you can do in this Shiny app. So here it says we want to use the experiments function to extract or take the experiment data out of our mini ACC dataset. So the first thing we would do is load that dataset by doing data, parentheses, mini ACC and quotes and then maybe print it. So that should run the code. And if not, we can go to our orchestra session and try that there, if that's not. So this app connects to our in-house Shiny server and we've never really tested it for a workshop so we'll see if it works or not. Okay, so it looks like it's taking a bit so I'll just launch the workshop on orchestra. The password is RStudio, although that might also have to change with their new name. Okay, so now I'm on orchestra. I have access to all the files in the workshop. I'm gonna go into Inst, Tutorials, C-BioPorto Data. And you can see the path here on the right-hand side. Home, Inst, Tutorials, C-BioPorto Data. So I'll try to spin this up. I think this has Shiny installed properties. Yeah, so this does create a pop-up so just click Always Allow Pop-Ups and then try again. So now we have our little app here directly connected to our orchestra instance. So I picked the wrong one, sorry. Yeah, you guys should have told me. This is just a comedy of errors today. Okay, so now we're in the right one and hopefully things will go smoothly. Okay, so we can run the code to pull the mini-ACC from the package. So this is an example data set included in multi-asset experiment from the Adeno-Cortico-Carsinoma data set from TCGA. So it has a limited number of observations in the data. And as you can see, this is our, or now you should be able to see. We have several different assays stored in this mini-ACC. So we have RNA-C data, GISTIC, reverse protein array data, mutations, and microRNA-16 dataset. And you get a description of how many rows are in each data set and how many columns or samples are in each of these. So we say rows and columns because multi-asset experiment is a pretty flexible container, so it doesn't have to be genes and samples. It could be really anything that has those basic requirements, the rectangular requirements for the data. So now let's run experiments on the mini-ACC data set. So we type in experiments and this has some nice auto-completion here that we could use and then we rerun this code. So now you can see that when you do experiments on the mini-ACC, you extract the actual data within the multi-asset experiment. So that kind of removes the call data because the call data is in another part of the object, so this only gives you the data that was entered for each of those assays. So if you click solution, you'll see that's the code. So this is pretty nice and interactive for displaying our tutorial here. The next part of the multi-asset experiment container is the call data that includes genotype information like survival time and all of those clinically relevant variables. So the interface, we've tried to make it as simple as possible, so if you do call data on mini-ACC, that should also, so I guess it doesn't keep that mini-ACC in the family. So this should work. So then with the extractor function with call data, you'll get to see all of the clinical variables like years to birth, vital status, days to death. So it's really easy to work with this. The phenotypic data, but also use it to divide your data. So say if you're interested in a particular variable, say you want T4 pathology stage, you can do mini-ACC, dollar sign, pathology, T stage equals equals T4. And this will access that column in the call data and find all of the patients that had T4 stage. And you'll see that you get this at the bottom here, you get something that you can work with to subset your data. And we'll go over, down the line we'll go over and see how to subset your data with these logical vectors here. So I think we need mini-ACC here as well. So say you want to extract the race variable, you can do mini-ACC, dollar sign, race, and that will allow you to pull that variable out and you could even do a table on that. To get a tally of all of the categories in there. And then we have this, similar to what summarized experiment provides, they have this assays function, which allows you to extract in a slightly different format all of the data within the multi-assay experiment. So if we run that, we'll see what that looks like. So this is a simple list and you can see that it's a list of matrices and maybe we can take the first one just to have a look. I can try ahead on this maybe. And the first four columns. So you can see that for that first assay, we have the actual data as a matrix and we can do a class on that just to what type of data we're working with. So this is a standard matrix. So this is helpful in case you want to take this data and analyze it with some tool that accepts only matrices as input. And then we have the sample map function which is useful for keeping track of all of our samples and patients in the multi-assay experiment. So we're gonna start off with data mini-ACC and then do sample map mini-ACC to have a look at structure. So you can see that this is a graph representation of all of the data that's included in the current multi-assay experiment here. You have the assay name and this is important. You need to have every assay in your multi-assay experiment has to have a name so that we can track and make it easy to subset by observation or by columns. The second column here is primary which corresponds to the patient ID in TCGA and the call name are the sample ID here. So most of the time you won't need to be working with the sample map. All of the operations that you do directly on the multi-assay experiment would modify the sample map in the background but it's good to know how to access it in case you do need to work with it. So let's try some trivia now. What function do you use to extract experiments from mini-ACC so you can shout either one, two or three? Two, right? Yeah, good, okay, yeah. And then one common mistake is that you may confuse the class name with the extractor function. Here if you do experiment list in capital that will create, this is the constructor function for the experiment list class. So make sure you know the difference between the two. One is to construct an experiment list. The other is to extract the actual data from a multi-assay experiment. And lastly we'll go over the general constructor function. So there's three components for this multi-assay experiment. The list of experiments to call data and the sample map. So we can sort of combine the extractors and the constructor function into one call to make sure that we know how to use the multi-assay experiment constructor. So these are the component pieces that make up the multi-assay experiment as outlined up here. So we simply need to say experiments equals here just so that we're clear on what the argument names are here, call data equals, and then sample map equals sample map. So if you don't have a sample map and we do provide facilities to create a sample map or if your patient IDs and your sample IDs are the same then the sample map would get generated automatically so you don't have to worry about it. But if they are different then we have some helper functions to allow you to create a sample map so that you could pass this to multi-assay experiment and get started on that. And I think the only thing we need here is the data. The ECC. Okay, so what we've done here is we reconstructed and reconstructed our multi-assay experiment. Sort of like avocado toast at the restaurant. You get it deconstructed. So we show that it's quite simple to create a multi-assay experiment from these component pieces. Any questions so far? So we have worked on a sort of cheat sheet, spreadsheet where we list all of the functions that are helpful when working with the multi-assay experiment containers. So feel free to reference this cheat sheet. It's also available in our landing page for the multi-assay experiment repository on GitHub and it will give you an outline of what all these functions do. And they're broken down by the categories like constructor, accesses, subsetting. So, and then management, working with missing data, finding the complete cases across all of the experiments, finding technical replicates and removing them or merging them, intersecting rows across experiments. So if you have the same set of genes of interest that you want to find across all of your experiments, you can do that. And then we have a helper function, prep multi-assay, which allows you to get set up and probably shoot any issues that you may come across when constructing the main representation. And then we also have functions to reshape the data. So these are good for when it comes time to plotting your results or the data that you have. Converting it into a long format is usually the way to go if you're plotting with the tools that require that shape. And we also have some combining functions like the C function to add experiments to an existing multi-assay experiment. And we have an example here of how you would concatenate. So if you wanted to take the log of a particular assay already in the mini ACC, you could do that and then add that back to the mini ACC with the C function. And this map from argument tells the software that this dataset has the same dimension names like rows and columns as the first assay in mini ACC. So it saves you time so that you don't have to rebuild the sample map yourself, it'll do it for you. And now if you take that mini ACC and look at the experiment, you'll see that the new assay has been added to the bottom of this experiment list. So we looked at experiment list and those metadata. One note about metadata, it's currently, metadata is hard to pin down because it can be varied. So what we use is a simple list for the metadata. So what we have here, you might have a PMID or a source URL so you can add things to the metadata that are unstructured and have that move along as you go along with your analysis. Okay, so now let's talk a little bit about curated TCGA data and I want to run these interactively. So I'll go to the RStudio server or the orchestra platform and what we can do is type health package equals multi assay experiment and that will pull up the documentation for the package and then you can click on these user guides, package minutes and I think you click on the first link and this will, this might not be it, it should be this one. Okay, well we can, I guess copy and paste here or go over the actual source code. Not sure why that's not there but maybe I'm not looking in the right location here. Okay so let's go to the, I'll pull out the outline here and go to where we left off here, sorry for the scrolling. So here we are with curated TCGA data. So this is an experiment data package on Bioconnector and it allows you to pull data from the TCGA data that we've uploaded to Experiment Hub. So with one line you can get all of the data that you want and list it. So first we'll run library curated TCGA data and then run that line that says show me what's available for this cancer type. So that's ACC here and what you'll get is a table of all of the data that's available for this cancer type and you'll get file size overview, how big the data is, like for example, methylation bit big and then you'll get some more information about what the data types are for each of these. So it allows you to pick the things that you want and the things that you don't want, right? So say we wanted this set of assays, you can include them as such and then run this code so what it does, it goes into Experiment Hub and if it doesn't have this already cached then it will download it from the Experiment Hub service and then what you get back is a multi assay experiment. So this is the actual data. We were working with a mini ACC which is a toy data set but this one has the actual data from TCGA. So you can see that the number of rows are significantly increased and so we got what we requested and it took maybe, I don't know, like 20 seconds to get that from Experiment Hub so it's really quick and you can get started with exploring what's in that container. You can do call data ACC and maybe get the first one to four columns and you'll see so we have the clinical data as well with the data sets that you requested. So I think we've come a long way but there's still more ground to cover, right? And when we first started working with CBIOPORTO we had someone at, in the talk that we gave someone was really grateful and they were like you should have had this when I was doing my PhD you don't know how much work I did to clean my data and integrate it. So we're saving researchers a lot of steps by already having everything integrated and connected so that you don't have to spend 80% of the time cleaning the data. So I'd say that multi-asset experiment's pretty useful in that way and it's one of the top 10% bioconductor. Most use packages in bioconductor so it is useful to some extent. So now let's talk about CBIOPORTO data. There are two main ways to access data through CBIOPORTO. One is to get the pre-packaged data using CBIOPATAPAC and then the other way is to use the API interface to download data. So we make this really easy so that you don't have to wrestle with the API, we do that for you. So what you need to do is just know what data set you want and download that. And what you could do to look at what data are available is by, first we'll load the package and then we'll do CBIO. Yeah, so you may see this warning message. The MSKCC team is constantly working on their API so things change and I put that warning in there so that I'm aware of things that are changing and if I need to modify things I will do that in the package. So if you see that, don't worry about it, the package should still work fine. And when we do get studies, you'll see that you'll get a table, a neat little table of all the studies that are provided by the CBIOPORTO service. So you get their name and any publication, the journal where these studies were published and some description whether the study is public or not, import date. So you have a lot of metadata to work with here but one of the most important ones to look out for is the study ID column which is what you use to download a dataset. So we can try this with the data pack. So CBIO data pack and we just include the study ID as the first argument, the cancer study ID and then run that. It asks you if you wanna reserve a spot to cache your data, we say yes. And then it pulls the data from their AWS bucket, I believe they have these data packs in the bucket. So you just pull that in and then the software does the rest for you. It coordinates all of the data that are available and creates a multi-asset experiment for you. So you can see that you have the CNA data types, methylation, mutation, RPPA. So these are coming from the CBIO portal service. And then we have the CBIO portal data function which is the one that accesses the API. So what we can do is download this study, URCC. I forget what that stands for. But it should be in our study column. But yeah, you can download that data with a simple one line command. You may include a gene panel of interest that are published by MSKCC or you could provide your own genes that you want to get the data for. So the CBIO data pack gives you everything. This CBIO portal data function is a little bit more fine-grained and allows you only to pull the data that you're interested in. So it minimizes the use of the service. So if we run that, you'll see that we'll get two assays and they're represented as summarized experiment and range summarized experiment objects depending on if the data has some ranged attributes included for this mutation data type. So now that we have TCGA data to work with what do we do with it? And this is where TCGA utils comes in. It provides a number of functions to work with the data that you've downloaded. Some of them are converting the row annotations. So if you have microRNA and you want to convert that to ranges, you can do that. Or if you have gene symbols, you can take those gene symbols and convert to ranges with the symbols to ranges function. Or if you have a region of interest, you can use Q reduce TCGA to narrow down your data to only that window. And as I said earlier, well actually range summarized experiment is sort of an evolution of summarized experiment with annotated rows with genomic ranges. And they work for copy number data sets, mutation data sets. So we have examples of how you would do that. You can do that with the ACC data set that we downloaded. And there are some quirks working with this data. You may have to convert the annotation type. So if we do genome, and I don't know if genome is loaded, but you can see that our genome here that is annotated as, I think this is either, I think it's NCBI, but you can, we do some hoops to make this all work because when you do Q reduce TCGA, the annotations have to match in terms of the genome. So this is just some steps that we do to make that work. And we have to load the package to use this function. So if you do something like Q reduce TCGA, it will take your mutation data and use some annotations packages like org.hs.eg. and convert those row annotations and compress to only the window of, I think these are only genes, so you would have those rows in relation to those genes. Same thing with symbols to ranges, convert as well. Simplify TCGA sort of does everything. In one go. And then we have some functions to help you learn more about what's in the data, especially if you're using TCGA. So we have this sample tables function, which tells you what are the sample types included in the data. So these are the TCGA codes. We can pull up a, okay. Someone's working against me behind the, yeah. So we have this sample types table that's included with the package that gives you those definitions for those codes. So you have zero one for primary solid tumor, zero two for recurrent solid tumor. So you can see that in this CNV SNP dataset we have tumors, normals, that's 10, 10 are normals, 11 are also normals. So you may only want to look at tumors or you may want to compare tumors and normals. Depending on what you want to do, there are functions to help you either split things or remove things. In this case, if you want to compare tumors to normals, you can use split assays and the codes. In that, that you want to split by. And this is not the right way. And then you'll get some warnings if things are not consistent. But the point of this is to separate those samples easily for you so that you can run your analysis comparing tumors to normals or whatever you need to do. And you can see that now these assays are annotated with those codes at the beginning of them if they're available. For that first assay there were some, there were about five samples that were normal, but for the other ones there weren't any so that you only get the tumor samples. And we also have TCGA primary solid tumor so this function will allow you to just depending on what kind of cancer you're working with only pull out the tumor samples from the data. So if you run that, it will remove everything else and give you only the tumor samples. So that's one function to keep in mind. And then we did some curation with molecular subtypes in TCGA and that we included that data with curated TCGA data. So not every cancer data type has these annotations but some of them. And you can look at what data types or what cancer disease codes have some subtype information by doing get subtype map. So it gives you a map of all of the columns that are associated with these subtypes so that you can use that data and take a look at what kind of histology is present in those samples. And then we also have facilities for working with barcodes. So if you've used the genomic data commons before you know that they've switched mostly to universal identifiers or UUID. So if you have something like that and you want to convert to a TCGA barcode you could do that. And this will give you a data frame or table representation of that UUID and the submitter ID which is the patient ID in TCGA. So it's good if you come across a file and you don't know or if you're downloading files and you don't know what their TCGA ID is you can look that up by either using the file name or the universal ID to translate those to something that is more specific to the TCGA project. This is a function in, so the question was it's a built-in function. It's a function in TCGA util. So it's designed to work with TCGA so we don't include it in multi-asset experiments. It's in TCGA util. And if you have an ID you can translate to TCGA barcodes or vice versa. So some of these are case IDs like these UUID and others are file IDs. You can also translate file IDs and you'll see that it's a barcode with a corresponds to a sample. And you can do that the other way around from barcode.