 Hi everybody. It's ontology time. Are we sure that we are rolling for the global cloud? Looks like WebEx or whatever it is. Okay, wonderful. Well, thank you for coming. It's a small group, intimate. I see most people do not want to sit in the front, but I can come to you. All right. Well, ontology and ontologies for information science are getting more and more familiar to just about everybody. So probably you don't need the basic introduction, but maybe some of our listeners do. Controlled inversion vocabularies with fixed length tags. And you'll see quite a few of these as we go on. Controlled vocabulary. It means that there is a collection of individuals engaged in choosing terms that should be used conventionally to label things, processes. If you think about the gene ontology, for example, that's been around an awfully long time. There are three major subontologies. Cellular component, molecular function, and biological process. What term should we use to characterize sub-concepts of those three major concepts? Well, the gene ontology consortium comes up with the conventions. And then there are G-O, colon, N, N, N, N, N, N, N tags for each one of the terms. And then there is some kind of graph relating all of these terms to one another. And it's supposed to be a directed acyclic graph. So, systematic controlled and versioned vocabularies with fixed length tags. And it turns out that the versioning is coincident with growth. In the case of the cell ontology, for example, I'm going to be showing you something which is somewhat outdated and I'm concerned that the tools that I use to import it into bioconductor or R aren't going to work anymore because the number of categories and relationships that are thought to be relevant to cell ontology has grown so much that the simple tool that exists doesn't seem to be capable of parsing it. But we're not going to belabor that. Controlled and versioned vocabularies, relationships among terms and term reference, formally defined. And the relationships are often of the form is a, but they can be more elaborate. And we will look into that. And then the question will arise, when we think about cell ontology, have we characterized in the cell ontology all of the features that we would like to be able to characterize cell types with? Do we do it in a consistent way? And I would say the answers to those questions are unclear. Now, in a more elaborate treatment of ontology, we would get into some more, I don't really know how to put it, this RDF resource description format, I think is what that stands for. And owl in RDF owl is web ontology language. Those languages or markups or however you want to put it, are the canonical tools at the moment for representing ontologies in their most complete form. And one of the things you will hear about in connection with owl is the idea of inferencing. So there is a field of logic called description logic, which has certain limitations relative to general propositional logic, so that you can do computations about descriptions. And tools for doing that should help us to learn about the conceptual relationships among the things we are trying to ontologize. I will not be getting into that, but it is a very worthy subject of study for bioinformatic applications. Perhaps next year we will do something more elaborate about that, but this is a very introductory treatment of the topic. By the way, stop me at any time if something I've said is wrong or needs clarification. So RDF exists, owl exists, and owl is the way of using RDF, for the most part, to express the entirety of an ontology. And open biological ontology is oboe, and that's another form of describing the information in ontologies. And I think the idea among ontologists is that it is probably not an ideal representation, although that is what we use to bring ontologies into bioconductor in the material I'm going to discuss. Now, there's a very nice package built by Daniel Green, who's in the UK, called ontology index, and he has an ontology plot package also. In bioconductor, Laurent Gattot has built a package called ROLS, which is the R interface to the OLS, which is the ontology lookup service of the EBI. And you can use the ontology lookup service as the EBI to enter a term, maybe the name of a cell type that you're interested in. And then you'll get back a list of all the ontological resources relating to that cell type. And they turn out to be many oftentimes. And ROLS helps you to program over that process of getting ontological lookups. Glitter is an interesting package that helps you to query RDF stores. I didn't get a chance to do much with that. Only learned about it recently. It is in the R universe. It is not in CRAN. And it's easy to install and it helps you to form these things called sparkle queries without getting too involved in the complexity of that. Now, protégé is a famous tool that's been around for a while for working with ontologies. And here's a view of protégé in dealing with the cell ontology. It's very easy to use this web ontology, web protégé. And let me just see. I need to get my mouse one second. Any protégé users out there? It's for people who are building ontologies. Oh, yes. I threw it in here. And marking them up, putting additional information in there and so forth. Let's turn this on. Yeah. So you can get a sense of the complexity and richness of the cell ontology by using protégé to import it. And you can import it either in the owl form or in the oboe form. And what we have here are references to other ontologies. So I know that this is protein ontology. For some reason, protein ontology term is up high in the graph of the cell ontology. These are sequence ontology terms. Well, I guess I could click on it. What is it? Sign in. Okay. So my system went stale there for a little while. And now I'm just clicking on this sequence ontology item. And this is such a big system that it looks like it's taking a little while to wake up. But in any case, you can get an account on web protégé very easily and import any ontology you're interested in and have a nice interactive interface to learn about the terms and the overall structure of the ontology. We'll come back to that. There are some other things that help you do this inference over ontologies. I don't think web protégé does it. I'm not even sure desktop protégé does. But robot is one system that does. And there was another wonderful Apache system called Jena that did that. And I don't know whether that's still functioning. All right. So now let's talk about some applications in our world of biomedical research, computational biology, and so on. The first one has to do with GWAS, genome-wide association studies. These are studies where individuals are taken and genotyped and they are then characterized in terms of certain phenotypes. And then we test for the association between the genetic content at a SNP and the variation in the phenotype over millions of SNPs. So GWAS are very famous scientific endeavors. And there are thousands of them. And one of the things we have in the EBI is a GWAS catalog. By the way, I didn't really talk about it. I guess this is sort of a workshop. You can get the code out of the Anto 2022 GitHub repo in my GitHub, which is VJCITN. And this VJCITN.github.io, Anto 2022, is a package down site. So all of these calculations can be done with this package, which you could install using DevTools install from GitHub. Yeah. Yeah. VJCITN. Why don't I just spell this out here? I'll get you a little bit of a blow this up. Here we go. HTTPSgithub.com slash VJCITN slash Anto 2022. That's the source repo, I hope. And the github.io would be VJCITN.github.io slash Anto 2022. That should be good enough to get to what I'm showing you here. So EBI has a catalog and I went and put in the data folder of this Anto 2022 microphone. Anto 2022. GWAS catalog from EBI is something that I have put in the data folder of this package and it's called NuCat. And it has 379,589 GWAS associations. And the associations here are significant, statistically significant. Usually that's something like 5 times 10 to the minus 7th or 8th. And replicated. And the information on both the source result and its replication are provided in the GWAS catalog. And this NuCat just gives you a little slice of the information that's in there. And you can use operations on the g-ranges to get more information on a region you might be interested in. Let's say there's a chromosomal segment that you wanted to see if there are any GWAS hits in. You do a find overlaps with the genomic ranges and that's what you can find. Well, there's a piece of the catalog that the EBI makes available. They call it map trait URI. And that turns out to be these fixed length tags that I mentioned when I said what an ontology is for information science. These fixed length tags are references to terms in the experimental factor ontology. And there is a version of the experimental factor ontology inside the ontoproc package. And you use getefo-onto to get an instance of it. And it's just an ontology index instance. And then you can get the name and find out that the phenotype of this particular GWAS is, well, BMI-Adjusted Waster Comforts. Notice that this EFO tag has a very precise term here. And this is just Waster Comforts. Maybe it also says BMI-Adjusted in this disease trait tag. So this is a way of making very precise the phenotype, giving it a formal term, locating it in an ontology. And there may be relationships between this term and other terms related to body habitus. Okay? So when EBI gives you this, ontoproc can give you this. You don't have to use Rawls or anything else. You can just get the terms by knowing that you need to take, work with the named vector of names in the EFO object. Any questions about this? I get the EFO. I have it. And this is just a quick illustration of how you can get some information about a tag that you might want to know what it means. Now, that's application number one. Using the GWAS catalog and getting the precise definitions of the phenotypes according to the map trait URI, notice that this also involves a way of looking this up. Well, why don't we do that? Why don't we see what this is on the web? There you go. So this is the OLS, the Ontology Lookup Service. And they've given you a URI for smoking behavior. Now, let's see. Is that? Yeah. So this, let's check 7789 also. Because maybe this was a study of both waist-tip ratio and smoking behavior. Already, I've forgotten what the term is. What is it? 7789. Remember that. Here we go. Type it in. There you go. BMI Adjusted Waste Circumference. And here again, you see these trees or directed acyclic graphs that take you from, oh, this is an information entity. Nice to know. Anthropometric measurement, body weights and measures. And finally that, part of EFO. Now, database cross references, for example, there's a PubMed ID here and that is telling us about the GWAS that this is being derived from. So very nice way of working through a network of information resources anchored by the ontology. Another approach to dealing with GWAS, which I would say is somewhat deeper than that which is practiced in the EBI GWAS catalog, is from the group with Jib Hemani and Tom Gaunt out in the Bristol Integrative Epidemiology Unit. And George Davies Smith is also there. So they have something called the Open GWAS Ecosystem. And there's a package, which I think is only in GitHub at this point, IEU GWAS R, which has a very nice interface to a vast collection of results. Essentially, all the associations, not just the significant ones, in a great number of GWAS. And so if you run the GWAS info function with IEU GWAS R, you get 42,000 records. And each one of those is about a GWAS. The number is a little exaggerated because this includes EQTL studies. So each gene in an EQTL study is a phenotype and that leads to its own set of summary statistics which are collected in this archive. So it's more on the order of a few thousand GWAS rather than 42,000 GWAS, but it's still a lot. And there's a lot of depth. And you'll notice that there are, this is a tibble, and there's a bunch of columns available. And one of them is ontology. And so if we look at the frequencies of ontology, we sort the table. A lot of these studies aren't given an ontology tag, but some are. And you now know how to look up the EFO tags at least. This might be the human phenotype ontology. I don't know that I have that in ontoproc, but EFO we do. We might have Mondo also. And so you know now how to look these up. But one little gotcha is that notice that they don't consistently annotate the fixed length tags. We have colon here. We have underscore here. If you use this in the wrong place, you will get into a bit of trouble. Yeah. Yeah. You might have what? Some consistency. Okay. The question is about this problem. And I don't have a definitive answer about this, but I think what's going on is that in the GWAS catalog, EBI has used the underscore. And that's because they want to build a URL and the URL can't have a colon. Okay. So you'll get underscores somewhere where people have derived it from what EBI has done in the GWAS catalog. Other people just use the colon because they're not concerned about making a URI. That's my hypothesis. But it's just another one of those wrinkles in the world of bioinformatic data that you have to cope with. All right. So that's the GWAS issue. Phenotypes of GWAS' characterized in a very precise way. We can see things like how to derive a term about a phenotype. And you know, if you wanted to get a sense of all the anthropometric measurement descendants, you can jump up there and then see that there's a great many terms that have been built in the ontology and there are tags for all of them. Okay. And if you use roles, you'll be able to program into this, but I haven't gotten into that. Now, the more pressing issue I think for many people in this bioconductor meeting is single cell biology. And there's a wonderful book built by Aaron Lund of Genentech, the single R book. And this is just a snapshot from it. You should be able to find it. Just say single R book, bioconductor, and Google, and you'll get this. And one of these chapters is exploiting the cell ontology. And single R is all about annotating cells that you are assaying in your single cell RNA seek. And single R maps is a package that maps the labels and its references to the cell ontology. And that gives us a standardized vocabulary with which to describe cell types, facilitating integrated analyses with multiple references. However, another useful feature of the cell ontology is its hierarchical organization of terms allowing us to adjust cell type annotations to the desired resolution. This represents a more dynamic alternative to the static label.main and label.fine options in each reference. So label.main would be a relatively crude classification, maybe lymphocyte and label.fine could be something like activated T lymphocyte, something like that. Well, we can get more precise than we will. And this is, yeah, this is a reference to the place that I had been in the web protege. And what we're looking at here is a cell type CL 00002057. And it has, you'll see here an RDFS label. So there's something from the RDF schema, which is called label, and the label is CD14 positive, CD16 negative classical monocyte. And it has the tag CL colon 0002057. And then a definition and then a comment. This cell type is compatible with the H, the Hipsi. So that's, Hipsi is human immunology. I forget, it's some consortium related to immunology. Lioplate markers for CD16 minus CD16 negative monocyte. So you can put all kinds of interesting metadata on these ontological terms and get it out of here. For example, it's created by Terry Meehan in 2010. Now what was interesting to me about this particular ontology instance is that it's not just telling me about the derivation of this concept through ISA relationships, but it also has these other relationships. Has plasma membrane, CD14 molecule, and also a high affinity immunoglobulin gamma FC receptor 1 and so on. And it was these additional features that I wondered about. How can we take advantage of that information? Is it uniformly available to that level of resolution for many of the classes of cells that we're interested in? And how do those terms play out in connection with the things we're going to do in RNA-seq analysis of these single cells? So that's a lot of questions and I don't have a ton of answers. I haven't had time. But let's go on a little bit and see what we can do with ontoproc to go a little bit further. Questions or comments at this point? So here we go. We do get cell-onto in ontoproc. We get an object we'll call cl. And if we get the first few tags there, you see that these extremely general terms that they want to be able to make use of in characterizing cells. But if we have a bunch of these fixed-length tags, we have a function called ontoplot2 that will show us the relationships among cells ultimately leading to the CD4 positive helper T cell or the CD8 positive alpha beta T cell, which are derived from lymphocyte in this simple path. So this is a way of giving you some orientation. If you have a bunch of cell types and you put them into the ontoplot, where the first argument is your cell ontology object and the second argument is just a vector of these tags, you can visualize this. Let's use the CT marks function now interactively to do this. I'll put the mic down for a minute. A function called CT marks. And this thing is going to actually work with an instance of the protein ontology and the gene ontology as well. And it will produce a shiny app. And just so you can see the tree here for a certain cell type which is the CD4 positive alpha beta T cell. It's a little different from the other one. What I'm going to show you here is a GR1 high classical monocyte. So the graph here has a lot of components that maybe we're not that interested in, but eventually we start getting down to this classical monocyte and see that there are different paths down to that cell type 2395. So you can do that with any of the cells that are in this list, but this is a very special list. And the reason that it exists is related to those extra features that I showed you just a moment ago. And the piece of the app that helps us think about this, maybe it's down at the bottom here. It blew it up a little too large. There. This tags component of the app is what was more interesting to me. The table here consists of information obtained about the query cell type. In this case, well, it's not the one I wanted. It's the GR1 high classical monocyte. The other one may be interesting too. Traversing the intersection of cell ontology elements associated with it. When multiple distinct entries are present in the tag and name field, the properties of the query cell types are asserted to be the intersection of the properties of the named additional cell types. Let's see what we're talking about here. This GR1 high classical monocyte lacks a protein membrane part. So this is a relationship in the cell ontology from CL2395 to protein ontology 1014. This 1014 is PTPRC, CD45R. And the term for that is receptor tyrosine protein phosphatase C isoform CD45R. So what we're saying here is that this type of monocyte lacks this protein ontology element. You can keep going. Another thing that it lacks is CX3CR1. It also lacks SPN. But it has a high plasma membrane amount of lymphocyte antigen 6C2. So all of this stuff is encoded in the cell ontology by stating specific relationships to the protein ontology. There are also relationships to the gene ontology which tell us that this cell has a nucleus. And that's less interesting, I would say. Vince, Wes asked online if there's any disease-state cell ontologies in there. Disease-state cell ontology. Like myeloma cell from plasma and B cell lineages. Yeah. Well, let's see. We could certainly see whether there's anything like myeloma in this particular collection. And then if there isn't, we can go to a different place to check. So we will work on that. So if we say MYELOM, we don't. Now, that doesn't mean that it's not in the cell ontology instance that we have here. Let's take a quick look using R. I'll stop the app. Any questions about the app? Okay. So let's just stop that. And now, to answer your question, we will just query the names in the cell ontology here. So no luck there. If you have other, I don't know, disease-specific cell type names that should be in cell ontology, then we would need to petition the cell ontology consortium to introduce those and give them tags and relate them to the other cell types. It is a topic that there is some literature on what to do when you are looking at, want to think about new cell types. So let me just jump into a little paper here before we move back just one second. So this paper is getting a little old, 2017, cell type discovery and representation in the era of high-content single cell phenotyping. I'm not sure this is exactly the one that I wanted to look at, but yeah, here you go. So let's blow this up a bit. What they're doing here is a single cell RNA-seq, and this heat map is showing you that there's a bunch of genes here and cells, and you start to see this pattern where there may be cell types that are very specifically identifiable through having high expression of certain little groups of genes. Their point is to turn these into cell types in the ontology that refer to very specific subtypes of brain cells. So looking up these papers from Shoyamon, this was related to the human cell atlas ontology project, gets you there. The question was, if you had this information but the cell ontology didn't represent it, what would you do in order to introduce that and make use of it in your own work? And that's one of the things I wanted to do with Antoproc is to have facilities to enlarge the ontology you're working with to deal with new types and features and so forth. And if you read the vignette, we do get into that a little bit towards the end, specifically referencing this. So that's a long-winded answer to a good question, and hopefully we can make progress in thinking about those disease-specific cell types that you mentioned. I think we're winding this thing down. We have nine minutes to go, and the last piece is a wonderful little bit of work by Aaron, which has to do with this concept. He calls it rolling up. So you may have one reference where there's been very fine-grained definition of the reference cell types, and another where they're rougher. And you want to be able to conveniently, you're not going to be able to do anything with the rougher characterization, but you want to go and roll up the fine-grained classifications to the same level of resolution as the other reference. That would allow you to work with a larger family of single cells to do your annotation. And so this concept of the latest common ancestor is something he wrote a little bit of code to do, and that was incorporated into Anto-proc. And here's how it works. This is just building a graph from CL parents. So this cell ontology, I didn't talk about the structure of this thing, but basically CL is a list of things like parents, children, terms, and so on. And basically, if you take the parents and do this little job on it, you get a graph with, in this case, 9,800 nodes and 15,000 edges, and that represents the cell ontology. And then we have a nice function which says, find the common ancestors of two sets, potentially sets, which would just be vectors of CL tags, cell ontology term tags. And give me the graph, give me these sets, and I'll tell you about the common ancestors. And what happens here is, if these are the two terms related to, yes, over here, is that right? No, that's a different one. Just give me a second here. Yeah, there we go. So let's say I had one system that annotated things down to either classical monocyte or CD115 positive, and I had to roll it up. Well, I could put this in and ask, what's the most recent or at least latest common ancestor? And let's see if that calculation worked. Should be 576. And there it is. So by putting in those more fine grain terms and asking for the common ancestor, we get back the common ancestor that we saw on the plot. And this is much more general than that. It can deal with the least common ancestors for collections of cell types. And that's why you get this more complex representation of the descendants of this thing. But I thought that was a pretty nice little demonstration of that that we can see is correct. So I think, yes, go ahead. Does the human cell atlas use the cell ontology? I think they would like to, and I think they would like to contribute to the cell ontology to make sure that cell types that are identified there are present. Whether it is used systematically to annotate all of the cell atlas resources, I would not be sure. The dean. Okay. So the answer here is that cell type as what lab is this? Saratikeman's lab. Okay. And is cell type as an ontology? Okay. Okay. Be careful, because the cell line ontology is not the cell ontology. Yeah. Yeah. Good. Well, that cell typist is something we want to bring into Antoproc if it's ready for that. So we'll take a look at that. Anything else? Yes. Yes. Yeah. So the question is about anatomical ontologies. We have uberon, I think is the main one. And a few other options. If you would file an issue, I would try to get it going. Thank you. Yeah. Now I'll ask Erika, are we going into closing at 5.30 or at 5.15? 5.15. So we're done. Thank you very much for coming on and it's time for closing ceremony. Well, then ask some more questions. Not at all. Okay. All right. I'm going to shut down. Thank you very much.