 Well, it's 430 and I do think I will call this session to order. It's a pleasure to welcome Professor Martin Morgan of the Roswell Park Comprehensive Cancer Center, State University of New York at Buffalo to Bioc 2022. Martin has led the bioconductor project for quite some time and stepped down from leading the core. Just a couple of years ago, and at that time, Wolfgang Huber made a nice slide detailing how the project had changed under his leadership. And there are great many changes such as a vast increase in the number of packages, number of users, quantity of data being shipped out, size of the bill to Amazon Web Services. And really important changes to governance and community engagement. So, you heard also from from Air Vipage today when he received his Bioconductor Award for 2022, just how effective a leader, Martin, has been. And so, I welcome you and I thank you and take it away, Martin. He's going to speak about the human cell atlas and the many interesting functions he has written to help us explore that resource. All right, thanks Vince. It's, of course, strange talking to people over the ether waves, knowing that some of you are actually present in the same room. So, I'm going to share my screen. And very confusing here, Google Chrome. Here we go. So, I hope that you can see my screen. Is that looking good for people? Maybe a thumbs up or something? We can see it. Great. Perfect. Thanks, Mark. Right. Yeah. So, I'm great. I wanted to talk about accessing human cell atlas data locally and on the Anvil cloud. And I'm going to use this amazing orchestra technology as well as package down. So, I hope that you'll follow along the. The package that I'm using is documented here. And I'm going to use orchestra here. I'm going to find, for some reason, it shows up for me pretty much at the top with a ton of launches already. Accessing human cell atlas data, you could search for like human or something like that. Accessing human cell atlas data locally on the Anvil cloud. And I hope you clicked on that and follow along and get yourself in our studio session that opens in the package. Directly, totally amazing technology. That's so, so impressed with that. Yes, so I'm going to talk about the human cell atlas and cell X gene. 2 different sources of single cell data and the boots on the ground work have been done by Kayla and you Bo and I just get to claim the glory here. I'm going to follow these vignettes and I'm going to start with this 1st vignette, the A HCA dot rmd. And again, I hope you're following along. I'll go slowly. I was listening to the tidy transcriptomics presentation and the presenter was totally awesome. You did a way better job of presenting than I've ever done. So I hope that you'll follow along and we'll spend about 20 minutes on this vignette and maybe 15 minutes on the bee vignette and a few more minutes on the sea vignette. I wanted to start with thinking about bioconductor and single cell and I'll just go to the bioconductor.org website and visit the 21 40 software packages and type in a single cell and find out that actually there are 213 packages that have been tagged as single cell. And I'm sure you know this, but if you drill down on any 1 of these packages, you get information about how frequently it's been downloaded the amount of times it's been asked about support on the support site. And of course, these amazing vignettes, which can totally blow blow you away with the richness of these resources. So the 1st point is there's a ton of resources, individual packages that allow us to work with single cell data. And then an amazing secondary resource for single cell data is this book called Oscar. Books, Oscar, orchestrating single cell analysis with bioconductor, several authors. I was just going to jump down to the workflows and there are 14 different workflows that walk through particular analyses. This, this last one grabbed some human cell atlas human bone marrow data and imports it with a few lines of code and then walks through a typical analysis. With the rich vignette and amazing integrated graphics and so on. So the point, the 2nd point then is that that Oscar is an amazing resource for learning about single cell analysis. So what I wanted to do though was actually explore some of the single cell resources that come from the human cell atlas. And I wanted to start, I'm a big fan of the tidy verse type of approach to things. And so I wanted to start just with a quick refresher and load up this deep layer package and start with a familiar example that empty cars. Motor trend cars, the data set, which has, I know some number of rows and cars of different sorts. And I just wanted to represent it as a tibble and then display it here as a tibble just to kind of see what we've got. We've got cars and miles per gallon and so on. And then this pipe, the native pipe takes from a base hour takes whatever's on the left hand side and use it as an argument on the right hand side. So this says show us only the six cylinder cars and only select the car miles per gallon displacement and horsepower columns and then update things a little bit with some metric unit liters per 100 kilometers. And you can see just how expressive this kind of tidy framework is. So I'm totally enthusiastic about that. And that's going to kind of form the basis of how we're going to access human cell atlas data. So what about the human cell atlas? So I wanted to go to the data portal human cell atlas.org I could just a human cell atlas portal gets us there. And this is the human cell atlas is sponsored by the Chen Zuckerberg initiative, a number of individual projects that collects cell level resolution expression data. There are currently 267 projects and you can see that you can navigate these and find a particular experiment and learn about it and then download matrices that process expression values that you could then use in your own analysis. So like that's the website, but perversely enough, I'm not so interested in the website because this is this sort of graphical thing this kind of navigating through and clicking on things is really error prone. And, you know, I actually clicked on some arbitrary thing I don't even know for two minutes ago which one I clicked on. And there might be many reasons why I'd want to do something that was more reproducible and rigorous, maybe even, you know, systematically search for studies that investigated say the brain or something like that. So I've written a package with Maya called HCA. So I'm going through this a HCA data set vignette. So I'm going to load HCA and another package this guy Daniel then Chris can I worked on together as well as single cell experiment. So I'm just going to load those guys into my R session. I think I didn't load them all I just loaded the one that I've loaded them loaded them all into my R session. And instead of going to the website and exploring things that way I'm going to enter this command projects and that'll do the comparable thing it'll visit the website or the API the application programming interface that underlies the website. And it'll page through the projects collect all of the information and report the data back to my R session. And so now I have in my R session. I have a table with 267 projects and 14 different columns worth of worth of information that I could then process using standard standard deep flyer types of techniques. So these are the these are the projects that we saw on the website just presented in a different way. You'll notice that many of these columns are list columns. And for instance this genus species you can see that some of these projects have just a single genus Homo sapiens I guess. But this particular project actually has two species mouse and man and mouse and human. And so the entire representation is a list column where each of the elements of the list are in fact characters of zero one or more elements. And that can be a little bit tricky to work with that can help to pull out individual columns or to use this tidy function called glimpse, which provides you with a little overview of what the what the table looks like, or pulling out a single column, the head of a single column and pulling that column and seeing that the first row was of this specimen organ part was cortex for the first experiment and the second experiment had two specimen organ parts and the third experiment and fourth experiment had zero specimen organ parts. So we're discovering how to navigate these data resources. There's also a neat function called HCA view, which when you invoke it produces a little tibble that you can navigate through here and search for instance for all of the studies that involve brain for instance and choose these studies. And, and then when you're done, my is now the tibble that contains the two studies that you selected. So very easy to navigate through and interact with these large data sets. Let's see. So when we were on the website, we saw that there were like, you know, you can kind of count the number of columns or like 10 or 12 columns here but actually these experiments are incredibly rich. They're represented as JSON lists, and you can actually request not just the 10 or 15 columns that are displayed in the project table on the website, but all of the data. So, we end up with the 267 hits and each of the hits contains a certain number of elements and each of those elements contains these subsets that expand into different components. Describing. You can see how fun it is right to navigate these things describing in very rich detail the experiment. I did want to mention. So, so it's just useful to know that there's a really rich data out there. And, yeah, instead of sort of pulling down all of the data and navigating it through it as we've done here, there's this amazing cat package called list viewer. I think this viewer. Jason edit. This to you. Jason edit. I'm going off script so I've lost track of a. Of the command, but something like that that this viewer, I'm pretty sure. That allows us to okay allows us to navigate these lists comprehensively, but instead of sort of retreating all of the data, it often makes sense to query the projects for things that you're interested in in particular. For instance, maybe you have an interest in in liver. And you'd like to find all of the projects that study liver. Maybe you're going to start a meta analysis or something along those lines. So you can specify a filter. That says a specimen or specimen organ is liver. And it could be a more complicated filter. And then when you query. Projects with the liver filter, you don't end up with 267 projects. You end up with 23 projects where the liver has been mentioned as one of the specimen organs. So it's a very useful way of finding data that you're particularly interested in. So when you look through the HCA package, you'll see projects, bundles, samples and files are the main entry points. And the queries generally return a finite number of resources and frequently you want to page through the results that so you get a thousand files and then you want to see the next thousand files and so on. And you can use HCA next or HCA proof to page through these larger data sets. So that's where I'm at so far. I can see I don't see anyone's faces or anything like that and I don't see a chat box. But if you're if you do have a question feel free to let me know. What I'm going to do is one of the great things actually about the HCA, the human cell Alice data portal. And unfortunately, they're not going to continue on with it. So it was a once great thing was that they applied a standard analysis protocol. So contributors would give fast Q files and then the HCA developed a standard analysis workflow so that all of the fast Q files were processed in the standardized way to account matrix. And then the count matrix is made available as a loom file, which fit, which can be easily incorporated into bio conductor and say at the start of one of those workflows that we saw in the Oscar book. So that was totally amazing. They've stopped doing that apparently, but they're about 77 files that are currently available. So here I say this is a little example of a little bit more complicated filter. I'm saying, hey, I'm interested in all of the standard loom files that have been processed using the standard pipeline and that have integrated the samples within a single experiment. So I'll evaluate this and vary for those files. And I find that there are that there are 79 of these standard processed loom files, which represents a wonderful starting point for, you know, tutorials or student training or for comparative studies. And what I'm going to do is I'm going to look for the loom files that are also of liver. So I'm going to take my liver projects and I'm going to do an inner join with the loom files using the project ID and project title to do that formatting. And then it turns out that there are 11 projects. That study liver and that had these cells samples processed in a standardized, a standardized way. Some of the projects had maybe there were maybe there were multiple samples, different types of samples that were involved in this particular project. But what I'm going to do is I'm going to grab one of these projects, this one here. And I'm going to grab its loom file. And I'm going to take a look at this. So let's see. So this is the, this is the cell I'm just going to focus on the second the human liver cellular landscape by blah, blah, blah. Homo sapiens caudate low with the liver 10x3 prime v2 single cell data processed in this standard processing way adult human cells and a loom file. And this is when it was was generated. And so what I'm going to do is I'm going to identify that particular file. And then I'm going to this file as you can see here, I'm going to actually retrieve the data from the HCA. I guess this is going pretty quickly, which is great, right? Because everything's in the cloud orchestras in the cloud, the HCA is in the cloud. So we've downloaded this loom file containing the single cell data from that particular experiment. And we've saved it locally on disk in a cache. So if we were to evaluate this command again, it would be cheap. We don't have to redownload the file. And what I'd like to do is take that file. It's great that it's on desk, but we'd like to be able to import it into into bio conductor. So I use the loom experiment package and I'll import the loom file from the location that I saved it on desk. So this is in real time. Now I've got a single cell loom experiment, which is like a single cell experiment. 58,000 genes, which is sort of the genes that are used in there, the HCA standard processing pipeline and 332,000 cells. So actually quite a bit of data here. You actually get quite a bit of useful information from the metadata on the loom. You get information about when the file was created, the inputs, the samples that were used, and the pipeline version, and so on. So tons of useful data. The thing about this loom file, which is a little bit disappointing but turns out to be can easily remedied. It has tons of information on the individual cells, this call data, these 43 columns. But these 43 columns are all like QC metrics, like reads unmapped and spliced reads and so on, and nothing about the biology. So we don't know if there were males or females in this study, what their ages were, anything like that. But it turns out that that information is buried in the HCA, and you can retrieve that information with this optimist loom annotation function. It takes the loom file and figures out from the metadata where the samples came from and collects the information about the samples, like the biologically interesting information. And then adds that, so now we've got 98 columns instead of 43, 98 columns that include biologically interesting columns, so 55 new columns. So for instance, we can take the column data of the annotated loom and figure out that there are actually in these 330,000 cells, there are actually five separate samples, four of them are male and one is female. They range in age from 21 to 65 years of age, and then this is the number of cells in each of the samples. So it shows that there's actually a very rich set of data that's available in the HCA, and it's extremely accessible from within bioconductor. So there are all kinds of interesting things that one can do to get going as a pedagogical tool and fitting directly into the orchestrating single cell analysis book that we saw at the very beginning. So I'm going to switch gears, but maybe I'll just pause and take any questions or anyone have any thoughts. Is this thing on there takes a second doesn't it. Anybody here have any questions for Martin. Why was there a need for loom experiment sorry these are Leonardo. Hi Martin. From what I see it like you end up casting into single cell experiment so I don't understand why what's different in loom experiment. Yeah. So the loom format supports, in particular this row and column graphs. So, sort of hierarchical structuring of the rows and columns. That's part of the loom format, which didn't fit in well with this single cell experiment class. So it's a, that's kind of a lightweight extension. But that's the main reason. I'll mention that these assays are delayed. The loom format is an HDF 5 file format. So it's an on the disk format. And when I loaded this loaded this file. It loaded really quickly because actually it didn't load at all of most of the data is still still on on desk. And so this object is actually quite lightweight. It also makes it easy to convert to single cell experiment. Thank you. Thank you Martin. So, but this particular object that you loaded had no bro and call graphs, right? It was zero. Yeah. Is that the case for all of the human cell atlas. And I think so. Yes. I think that none of the human cell atlas data actually. They produce loom files, but the loom files don't contain the distinctive characters that go for all of the column graphs. Awesome. Thank you. Pretty interesting. All right, maybe I'll switch to. Cell X gene. So, so I have a question. Yes. So the, so the loom experiment class is basically a single cell experiment subclass with a few extra fields out of them. That's correct. Yeah. Okay. Thank you. I kind of think that when, if we, if we were to drill down and go a little bit off piece, we'd find that. I think that we can actually import. When we imported this file as a loom file, I bet we could import it as a single cell experiment directly. Because, you know, it adversives itself as loom, but loom is a is a is a single cell experiment plus so we can just import the relevant parts. All right, let me talk about the cell X gene. This is really cell X gene is really a pretty neat little endeavor. This is by the CCI. And in many ways, it's similar to the human cell Atlas. There are a number of different experiments. Collections are like individual experiments and data sets are data sets within an experiment. So maybe one lab does an experiment that has several data sets. And so there's a collection that has several data sets. And then you can find the study that you're interested in Alzheimer's disease from Seattle appropriately enough and find out a little bit of information about it. And then when you click on this, it actually, it does something that I wasn't quite expecting. Oh, I didn't want to click on it. I'll go back here. And I'll, so we drilled down a bunch of additional information. I wanted to click on this explore icon here. And what it does is it the when the investigator submitted the data set, they also provided data sets to 21. So now I know that there are 21 data sets in cell X gene that contain African American females who were studied with this particular protocol. I actually find it interesting in and of itself. It provides us with a quantitative way of assessing about diversity and inclusion types of metrics within the project in a contemporary project. So we're going to grab the African American female data sets. And I'm going to find the files that are associated with that with those studies. And again, these are files. So they're 63 files associated with these 21 studies. And these are files provided by the project. You'll notice that the 63 divided by 21 is equal to three. Right. And there are three types of files each 580 this and data format RDS, which sounds totally promising, but is a bit of a boondogalist. Those are surat objects. And in general, RDS files are terrible for maintaining just because of versioning types of issues. And CXG is the internal selects, selects gene data format. So I'm going to grab this file. And instead of visualizing it through the browser, I'm going to invoke the browser from our and visualize that particular data set from from our. So I evaluated this set of commands and chose that file and sent it to the browser. And here's my, my window or my window. Here's, here's that data set here, colored by a cell type. But it would be way cooler to download that file to the local, my local disk and then do all kinds of fun things independent of selects gene. So again, I'm downloading this file from selects gene to wherever orchestra is running pretty quick because it's running in the cloud. And now I'm going to read it in. This is using ZEL converter. I'm going to read the H580 file from disk into R and then have a quick peek at it. And you can see that it's a single cell experiment. We've got 33,000 genes and 31,000 cells. And then we can just integrate that into a standard workflow. So trivially, what I'm going to do is I'm just going to create a ggplot that is the same as what we generated online. But of course we've done it interactively. And so now we have complete control over this object and it fits directly into the standard bioconductor workflows. Those selects gene, the selects gene package turns out to be a wonderful way of reproducibly retrieving the data and then introducing it into, into a workflow that we might be interested in doing. I'm going to stop there for another second or so. Happy to take any questions. Thanks, Martin. Leonardo again. So I'm going to ask you some questions that I get for recount. So two of them are my impression of looking at the object is you can actually combine the data across different loom objects or SE objects. I saw that the cold data you're using character columns instead of factor columns. So am I correct in guessing that you can combine all of them together into a single very large object. Yes. And in principle, you could. Yeah. I mean, they're sort of like issues of batch correction and so on. But yeah. No, no, but like just the C line parts of it before you analyze it. Yeah. Yeah. And then the second one is if I have my own data set, can I process it in a comparable way to compare against the H8 or is that not easy to do? I don't know. Yeah, that's the next part of my talk. So it's great. Nice. Yeah. Yeah, absolutely. Shall I go there? Please. Okay. Yeah. So Leonardo asked a great question and I'll just go back to the the HCA data portal and you'll notice that there's this. So we were under this Explorer tab where we had all of the data sets. But then there are these pipelines and the HCA came up with some standard analysis workflows. For instance, this Optimus workflow for 10x, V2 and V3 gene expression assays. And they formalized these workflows. So this, this is the sort of like the heavy lifting, right, of going from fast Q files to a count matrix. I mean, that's a part of the heavy lifting anyway. Anyway, that's it. That is a heavy lift. Big data to moderate data. And they standardized that transformation, the steps from from up from the fast Q files to the expression assays. And they've actually done a great, you know, all things considered a great job documenting what they've done. And the process has been written as a series of formal workflows in WDL workflow description language. And you can kind of parse through here. You can see that it takes a bunch of a couple of fast Q files. It's going to use the star liner. It's going to do us use a particular reference. It's got a, you know, a version on the pipeline. So that's really great. And this is a formal description of the steps that are involved, including parallel computation to transform the fast Q files to account matrix. So Leonardo, you could take your data, your fast Q files from your own experiment and apply this pipeline if it's, you know, the appropriate data format and come up with, you know, an equivalent starting point for your study as for any of the other studies. And I think that's totally amazing. That's really great, reproducible. And of course, you know, you and I can disagree or have conversations about the relative merits of different steps in the workflow. Maybe we just disagree with one choice of parameters or they should have done something else. But at least we know what we're talking about. And that's super useful and super exciting. And then the other so workflow description language, Optimus standard analysis pipeline, that's totally great. And then the only thing is that you actually need to compute resources to run the workflow right so so all well and good to have a workflow but if you can't run it then then that's unfortunate. And I just wanted to mention the anvil project, which we've been involved with, which provides us with that type of resource. So I've actually opened an anvil anvil.terra.pioanvilproject.org it's in the, in the, the third vignette. And I've opened a particular, it's called a workspace. And the workspace is cloned from something produced by the HCA that illustrates how to use that pipeline that we were just looking at to analyze data. So you have data, 10 samples with fast queue files, mice and humans. So there's sort of like two subsets of data. And then the idea is that you choose the workflow that you're interested in and provide inputs, including references, cloud based references to the fast queue files. So that these fast queue files are actually in the Google cloud, they're not local. And therefore access to them is fast in the Google cloud. And then when you're ready, you click the run analysis button, and it will launch Marshall, the compute resources required to run the workflow, run the workflow, and produce outputs that are available within the workspace, including, for instance, the loom file. And then you could incorporate the loom file into an interactive analysis using RStudio. So that's really just a very quick tour of anvil and the connection between the HCA and these well documented, well defined workflows. And the compute resources that are provided by anvil to allow you to do the types of analyses where, you know, like Leonardo said, with recount, like you really like to do the same analysis on a whole bunch of different data, and this provides you with a formal way of doing that. So I think actually I'll stop there. I think I'm at the end of my 45 minutes of fame, but I'm happy to take any questions or delve into more detail on anything I've been talking about. Any questions? Takes a little time. Hi. You mentioned something about a discontinuation of this process. Can you say anything more about that? Yeah. So there's a Slack channel. Human Cell Atlas has a Slack channel where they announce new data sets that are added. And I just noticed the other day last week that they were adding new data sets but no more loom files. And so I asked about that. And I guess they decided not to perform the standard analyses anymore. Partly because of, I guess the public explanation was that sort of lack of consensus about what the right analysis was for a particular workflow. So, pretty interesting. Personally, I think it'd be great to do the wrong analysis consistently, you know, tell the same lie. Yeah, that seems like it would be a little bit more useful to the world than not doing anything. Yeah. Or for each of us doing our own thing and telling our own lies. This is one of those situations where even doing the wrong thing might be better than doing nothing. Maybe, yeah. At least learn something. Yeah. I guess it opens the door for some for a project for one of us to take up. Well, one other thing that came up a little while ago with your metadata was the list of viewer and somehow I think that package was not installed. And I know your list of lists structures also have this reflection in James path. Is that something that you think more people should understand and take advantage of? Yeah. Yeah. So I kind of, it's too bad that was it is it list. It's list viewer. I think you just need to install it. Yeah, so list viewer is totally amazing. A way of navigating these complicated list list list structures. What did I have here? I think I called it. List viewer. Jason edit, which seems like an unlikely name for a function, but it provides so this was this list of list of lists of lists that I downloaded. It's taking just a second to open up. But here's, here's the viewer. And you can see, you can see that it's really straightforward to, you know, these are the 267 different projects and each of the projects has. A title. And you can sort of imagine navigating down through that. But there's also a language so that this little widget is great. It allows you to explore these things in a way that's way better than alternatives. But there's also a language for making these queries. It's called James path, J M E S path, which is probably like totally familiar to to the Java East does out there. But you can say, hey, I want, for all of the objects, I want to find the hits that correspond to have a project projects. Let's see what we got here. I have to. Let's see. So again, I'm going off. Off piece away. Let's see. Let's start this again here. So here's our hits. Transform. It's so all of these hits we want to find project titles projects. There's projects in each project. There are a bunch of projects. And each project has a project title. And those are the sort of queries into the kind of like XML X path you query into the the JSON object. So this is the title of the first project and the second project and so on. So this is called James path, James path. And actually the selects gene DB package, there's a very useful function that should be exposed more, more broadly called James path, which will allow you to query a list object or So these are JSON objects, the JSON object query it for things like project title. So providing a very convenient way of extracting data from these complicated structures. Thanks for that quick tutorial. That's something we're running into a lot more often the JSON metadata. So it's good to have. Yeah. Yeah, I think you introduced me Vince to this list viewer. That's a really great little tool. All right, I've overspent over overstayed my welcome so I'll sign off. Yeah. Leo again, sorry. Like looking at the metadata of your, any of your objects. I see that there's a lot of information there, but And I saw the optimus documentation lists like what version of gen code was used for example what version of the genome reference was used. And I saw that information is missing from the metadata. So, maybe that's an area that could for improvement there. Based on what my glove has worked on right like that sometimes finding the notation of a file is really complicated. Another thing is like, I see that my guess is that every single studies are only available in one version of optimus. Or is, or are there duplicated studies and I mean, he may just mentioned that maybe they're predicting some of this but with a new version of the genome. Are you going to have to redo all of this with like a new version of it. Because that's going to change all the coordinates and all of that. Yeah, absolutely. So, you're right that that these were run. There's only one loom file at most one loom file one. One, actually the loom files are generated on individual samples and then the samples are aggregated into a loom file that represents the collection of samples within a within a data set. But there's only one loom file per data set. And there's a particular version of the optimus pipeline that was used to generate that and and that also implies a particular reference genome. I'm not sure that the reference genome is actually referenced here. It could be, could be that we've lost some of that in in the provenance of the object. Well, Martin, thank you for sharing with us. This was great. My pleasure to learn something when you when you teach so still still going on years and years later. That's good. I'm glad I haven't graduated to the negatives are in camp. So I think this is the last session of the day. So I'm going to wrap it up. And I want to remind everybody that tomorrow we're starting up here, rather than in the building cure. So be sure not to go over there tomorrow. Coming here. And thanks again, Martin. That was terrific. Thank you. And of course, digital applause too.