 So thank you very much for inviting me. I have the pleasure of discussing where do we go from here and to talk a little bit about experiences in data integration in the neurosciences. This topic, obviously, databases and ontologies could be covered from many different angles. And we could also easily spend an entire course just covering any of these. I just want to let you know right from the beginning, I'm not going to delve too much into the technical details, both because A, I'm a neuroanatomist, and my experience is not as an information scientist or a computer scientist. But secondly, because I believe the issues that we're going to raise here are more along the lines of where do databases play a role, where do ontologies play a role, in really starting to knit together all of the different types of activities that you've seen here. So there's a lot of excellent material on the web. Many of you in the audience probably know more on the technical specifications about different representation languages than I do. And I certainly have references to those so that you can look them up. But I think there's been few areas in my experience in biomedical science that is more misunderstood and abused and maligned than ontology. Yet, if we're going to do what everybody in this room claims they would like to do, which is to be able to knit together all of the information that is coming from all of the different techniques, sources of data that we have in neuroscience, ontologies, I do believe, play a key role. So we've said for a long time in neuroscience that neuroscience really is a data federation problem. David kicked off the meeting with a slide that says, here's all the different levels of the nervous system that we study. Here's all the different ways that we study them. We have seen that slide repeated over and over and over again. And it's important to recognize that at every single one of those levels, people produce certain types of data types. They produce images or they produce plots or they produce traces. And we really don't have in neuroscience a unifying data type. We don't all rally around the sequence. We don't all rally around x-ray coordinates of a protein structure. We rally around everything. And our entire job is to try to figure out how these different scales, these different ways of studying the nervous system come together. And knitting that information together across scales and techniques because we don't have the Uber technique. There is nothing that would take the entire complexity of the brain and reveal it in all its glory. We have to rely on information integration. Up until now, the unit of information integration was the person. That's why we spent 20 years getting our PhD and we spent another 40 years if we were lucky studying the brain and we are huge fonts of knowledge and huge fonts of data integration. But we all know that there is an information explosion. We've always felt that there was too much information for us to handle. I read some papers on the first journal that was done back in the 1600s and people started to complain that this was information overload. There were too many articles for us to handle. The first time they invented the telegraph. They're like, ah, too much information. We can't handle it. We haven't yet reached our capacity. But I think we're getting close. Just because we now, through the internet, know all the stuff that is out there. So how are we going to handle this? Well, one of the first things I want to do is introduce a few terms. We have two ways, really, that we can integrate information in terms of information systems. With the emphasis these days on big data, you will hear a lot about data warehouses. And data warehouses essentially mean that I take all the information that's available on any given topic and I create one large schema, one large data model that describes it all. And if I get a new source of information, I align it, I clean it, I fit it to that large data model. So large consumer entities like cable television and things are assembling these huge consumer databases where they're taking everything about you and they're putting it together so that they can tailor their products to you and for whatever reason that they have. In neuroscience, obviously, that would be a huge undertaking because we are dealing with so many diverse types. And as we saw, our techniques don't really align up all that well. We saw that in JB's talk where we have a functional parcelation of the brain and an anatomical parcelation of the brain and they kind of relate to each other but we really don't know how they relate to each other in any sort of perfect way. In a data federation, the goal is sort of the same thing, only this is done sort of virtually. It suggests that data is kept in databases around physically distributed, it doesn't really matter. They each have their own way of modeling the data, they each have their own way of dealing with the data that may be optimized for their particular community but we can still draw on them and pull them through accessing services or other sorts of things so that we can sort of knit them together. But again, the goal of these two things are exactly the same. We want to be able to align our data, make it consistent so that we can ask powerful questions of it. So if we look at the state of biology, I made this for a text mining conference I was at last week and we say, what sort of knowledge are we talking about? These are again terms that I'll be using during the course of this talk. At the top, and I couldn't make PowerPoint fade that out into affinity, is basically what is potentially knowable inside of biological systems? And of course that's a question that we can't really quantify but we suspect that we actually know very little of what's going on that's out there. I think everything we saw today suggested that. But there's a lot of information that's potentially available and a lot of that, the bulk of it is what we call unstructured information. That's information that is contained inside of text. We publish a lot of our information inside of journal articles but there's also images and again, human beings. Human beings are fonts of knowledge. So that's potentially available but not readily available because it's very difficult to be able to extract it. The type of information that we typically talk about in a databases and ontologies is really structured information. Information that is structured so that a machine can read it and do something reasonable with it. And if we look at the amount of data that's available even though it is very impressive it probably is a teeny tiny portion of the actual information that we have. So we're really accessing only a very little of the information that's out there. And one of our questions is how can we make more of our information machine processable? You already heard that one of the terms of neuroinformatics is databaseing the brain. Why do we wanna database the brain? We wanna database the brain so we can make this information that we have more computable. So again, what I'm mostly gonna be talking today about is structured data and I'm sure many of you have heard the terms of structured and unstructured. Almost any term that I use today has a multiplicity of definitions. We spent a lot of time on the definition of neuroinformatics. I'm not gonna spend a lot of time going through them all here. I'm just going to give you reasonable ones of my understanding of the way this works. But essentially, typically, when we talk about structured data we're talking about data that has been organized with respect to a data model. So we know something about the structure of the data. We've specified the data types. So this is a string, a word or this is an integer, a number. And there's a formal query language or something that we can select information according to some rules. So in the case of the most common type of structuring that we think about a relational database we organize our data into tables and columns and rows. And we can specify something about the data type and we have SQL or SQL as the query language that allows us to say, I would like to select a subject where the species is mouse and I can answer this question here where mice are over 40 days of age because the database knows that 50 is an integer and therefore 50 is greater than 40. So I can do these types of basic operations on top of my data. If I try to ask this same type of question from text there's plenty of articles out there that would also be able to answer it. The computer doesn't know anything at all about integers or data types or anything else because we use numbers symbolically and not just as numbers. So it doesn't know that 50 is actually an age. It could be something else. And again, we have to be able to use techniques like entity recognition, natural language processing to be able to pull this out. If anybody's ever tried to do this or if you've read a scientific paper looking very closely at its rhetorical structure, you see that this is very, very difficult because humans were really writing this paper to communicate to other humans. They were not writing this paper to communicate to machines. So we tend to express things in ways or not express things and leave things out because we're very good at inferring this. So for example, if I'm looking for immunolabeling studies and I know anything at all about protocols, I know that labeled using antibodies means that this is an immunolabeling experiment, but the computer has no way of knowing that or knowing that that's a protocol. So one thing we know is that our ability to access data has changed dramatically over the years. And I think the effects of the internet, networked computers, all that has been felt most strongly in the area of data. As many people will point out, the PDF or portable document format is pretty much just a replication of what we do in paper, but we deliver it electronically. But if we think of something like the human genome and what I have here, the encyclopedia of life, if we had to publish that in a textbook, we probably never would have bothered to get it in the first place because flipping through thousands and thousands and thousands of base pairs wouldn't be very useful. That is if we don't make this information accessible to some sort of algorithmic analysis, what is the point? And we have various tools that we use. We have spreadsheets, the biologists' favorite database, if you wanna call it a database, an Excel spreadsheet. We have some large databases like the Protein Data Bank and GenBank that people access. But many of you, I'm sure, are familiar with this idea of the semantic web or the web of data. How many people have heard that term? Never really understood what it was. Okay. So the idea here is that the current web was really built around sharing documents and you have web pages and you have websites and things where you can go ahead and read and we have some ways of sort of searching for things and strings inside of those. But the web, by and large, does not function like a large database. It's not a large structured data repository. So in order to access most of that, you have to understand unstructured information and Google has done a remarkable job using statistics and other sorts of things and bringing us what we want. But you as scientists know trying to search for something and Google can often be very, very frustrating because, again, it doesn't know what we mean. It only knows something about the statistics of how things are used and a lot of what we do is statistically not all that prominent because it's so esoteric and arcane. So Tim Berners-Lee and others had the idea that actually, since we have all these networked computers, we don't just need a web of documents, but we also need a web of data. And the idea of the semantic web was that we would be able to express data in a way that its meaning would be obvious and that it would sort of be independent of data model and independent of schema and we'd be able to sort of mesh this all together so that I would be able to access data just like I can with an HTTP call, I'd be able to grab this. So the idea, again, of the web of data is it's distributed all over the place but I can sort of bring it together as I need to to be able to answer questions. And usually when we talk about semantic web, we talk about things like RDF and URIs and I'm gonna get to that later about what they all mean, okay? The first thing I wanna talk about is the context for my presentation and why I even got interested in all this and interested in ontologies. So I lead a project called the Neuroscience Information Framework Project. This is a project that was funded by the NIH institutes, the U.S. National Institutes of Health. And this was funded by the 16 institutes in the United States that deal with some aspect of the nervous system. So it's kind of interesting that if you go to the NIH, you see there's one institute of heart, lung and blood and one on digestive disorders and kidneys and one on imaging and there's 16 on the nervous system that have something to do with the nervous system. Including heart, lung and blood and digestive diseases, okay? So obviously we're distributed and there were very few unifying programs and they were realizing they were funding the same things over and over and over again and they were getting a little tired of it. So they wanted a big cataloging effort. You heard about the Human Brain Project already and Mark's talk, they said we've paid millions of dollars to create databases for the brain. Where are they? Who's using them? What are they doing with them, okay? So we were tasked with this job of saying what sorts of things are out there? How many are there? What domains do they cover? What domains don't they cover? Where are they? Who uses them? Who creates them? But more importantly, can we find them, okay? Do we even know they're there? And how can we make them better in the future? And this has really led to my thinking of neuroscience quite differently from the way I thought about it before because whereas before as a neuroanatomist I thought about my work and I thought about communicating my work to maybe a few of my colleagues. Here you had to sort of step back and look at neuroscience globally and say you're doing really fabulous work over here but so are 50 other people and you don't even know each other exists and even if you did know each other exists there'd be no way we could put that information together because we just don't have any type of unifying framework that would allow us to pull this together. So I've started to, when I talked to students in particular, I said because of the opportunity of the web we have to start thinking beyond the things that we create for ourselves and think a little bit about how they're going to plug into this larger vision. And that's not entirely intuitive. I will not say that we know exactly how to do it but I don't think it's too early to start so we can avoid some of the mistakes that we keep making over and over and over again. So basically when the NIF started we were given a budget and we were given a certain amount of staff. People often say, and I'll get to this in a minute, well how come Google doesn't do this? Well again this is very specialized knowledge and the type of information that we tend to get is not that accessible to Google. But when people also say, but we want this to work as well as Google and we want it to be as easy to use and have all the things that Google does, we remind them that Google's worth probably $100 billion and has a staff of $60,000 and we don't, okay? We have a lot less than that. So first of all, anything that we did also had to recognize that the people we were serving also did not have billions of dollars and billions of staff to devote to IT. And therefore we had to really make an effort to develop a system that would work with the big messy system that we have now and not one that was an idealized future because we were never going to get to that idealized future, right? So the very first thing we said is, well it would be nice if every piece of information we need is in a database but it's not. So how can we make data as collectively searchable as we can? Well we're gonna have to search the literature because the literature is the main place where things are mentioned. Resources in particular what we're looking for, databases, tools, materials, services. But as we find out, a lot of this is contained in databases and I will show you how many of them are in databases in just a moment. And how do we search across those when everyone has a different schema, a different terminology and what have you? So the largest part of NIF is actually what we call the NIF Data Federation which searches about currently 170 different sources that cover different data types and we organize them roughly by data types and nervous system level. We also have a catalog which is again our effort to find out what's out there. So we have a very easy simple catalog that says, well there's a data over here and there's tools over here and there's materials over here and these are available to you. This is supported as you'll see by a large ontology for neuroscience and that's what I'm going to talk to you about ontologies. But we've had to do various things in order to be able to search across these that are currently not supported by Google. In particular, we had to deal with something called the hidden web. Okay. We will get to that. I was going to ask, how many have heard ontology before? Okay, a few, but I will explain. All right. So what is the hidden web? The hidden web is basically that part that is out there that might have some portion of it that's accessible to a search engine by Google but by and large cannot be indexed by Google. Mostly that is data that is in dynamic databases. Because the dynamic database, when I put a query into a database and I ask for those mice and immunocytic chemistry, it dynamically constructs a view. There's no stable URL where that exists and so therefore it cannot be indexed, right? So essentially again, NIFs simultaneously searches across multiple sources of information because it's trying to find the things that you might be looking for. We're often asked, why do you again? Do you need us? We have Google and again, Google's designed to share documents and we have to deal with this thing called the hidden web. Also called deep web, deep web, dark web underneath all that. So how many of these resources are there? NIF has been in the business of trying to catalog these resources now for about four years and we currently have about 5,000 of them of which 2,000 are databases. These databases cover all different sorts of things. They're not all exclusively neuroscience. Some of them are not neuroscience at all but as we just heard, it's very hard to sort of draw a hard line between what is relevant or not relevant to neuroscience. So is the set of genes and yeast relevant to neuroscience? Of course it is, okay? So we catalog that too. So just the logistics of it suggests that if you wanted to ask a question where it was in one of these 2,000 databases, you were not going to be able to find it because you're not going to be able to visit 2,000 different databases all at once. That's one of the reasons why NIFs spent so much time on building its data federation and you can also see by the little map up here, these are resources that are being produced all over the world. We have very heavy representation of the US and Europe but there are things that are happening all over the place and we encourage everybody here who might have a tool to register it to the NIF resource catalog. I also want to just say something about what we mean by data. So when a bioinformatician comes to NIF, they're like, well, you don't have data. We're like, what do you mean we don't have data? They're like, well, we're looking for numbers. And I was like, well, we have some numbers but again, we're a data federation so we tend to query the metadata and take you to the original source to get the numbers because we see no reason to store those. But in fact, when you look at what people are producing that are databases, there's all different types of information that are in them. So you have the ones that the bioinformaticians, you know, unless you have them, they sneer at you which are basically raw numbers and raw is a relative term. But they would also be things like MRI databases where you could go ahead and get the scans. Geo, for example, has microwave data and they're microscopic image databases. There are a lot of databases which have what I call secondary data. That is, these are quantities that might be derived from primary data but they're not the data themselves. Now, a lot of people are surprised that I have the Allen Brain Atlas up there because they're like, no, there's big images. You can see the images. In fact, trying to access all of those images simultaneously and doing the image processing yourself on it, you can't do it. You can go ahead and bring hard drives to them and take away these terabytes of data. But my enlarge, when you do a query on the Allen Brain Atlas or you do an analysis, you are using pre-computed statistics that they give you that serve in proxy for the content of that image. You also see that there are a lot of databases that deal with tertiary data or whatever you wanna call it which these are really claims and assertions about the meaning of data. So a lot of times for us to be able to say something like area X lights up in task B, it's not that one scan that's telling us but you have analyzed that scan with respect to an entire experimental paradigm. You've done the statistics and you've made some claims about what's significantly up-regulated or down-regulated in a given experiment. There are a lot of people who extract information from the literature and again, try to structure it so that you can query it. So there's all different types of artifacts that are being created out there, not just huge databases of sequences. We can also see that there are things that are aggregating or single source. So the Allen Brain Institute is a classic example of a single source. They have created one gigantic atlas. It's all consistent and internally consistent and there are other things like the Protein Data Bank or Gen Bank where the community contributes individual pieces and they align them together. So there's all different types of things that are being created out there. So why do we wanna bring this together? I like to point out again that in an ideal world, there's two types of questions we would like to ask. We often think of data integration only with the second one which is what is not known. We're gonna use this for discovery science. We're gonna pull all this data together. We're gonna discover a new pathway. We're gonna discover a new vulnerability in Alzheimer's disease and obviously that's the goal. That's why we say we want this data out there. But we'd also like to be able to find what's known. So anybody who's ever spent hours looking through the literature on Google for a particular agent or a particular antibody or a particular rate constant knows how long it takes for us to find things that even are there and technically we'd like to bring them to our attention. So I like to say that the types of questions that we ask are not always profound. Some of them are just plain useful. I also put four stars on this because I think it's very important and when I do start to talk about ontology, this is one of the major clashes between practicing scientists and ontologists is that when we say what is known, we don't mean an assertion of fact. We mean we'd like to see all the data that is pertinent to answering that question and we expect that some of that data will be contradictory. We don't expect that it will all be consistent because that's science but we would like to know all that information and being able to bring it to our fingertips. So here for example is a type of question one might be able to answer from a NIF. You'd like to say what are the connections of the hippocampus? Well the very first thing that NIF recognized was if we didn't confront the terminology question. That is we have many different names for the same thing and the fact that the names aren't used consistently will be dealt with later. So if you just tried to search for hippocampus you would lose or miss a lot of information because some people didn't call it hippocampus. They called it CA1 or they called it Amon's Horn or they called it something else. So you had to be able to deal with the terminology that neuroscientists use. So if you search for NIF for hippocampus it'll automatically expand out to the synonyms. It will then tell you well we've got 170 databases and one of them actually has a data type called connections so it says. That because you put in these related searches somehow. If there's some of it they're assembled in different ways through semi-automated and also human. So we categorize these by data type and one of these is connectivity and it turns out there are about six databases out there that we know about that have connectivity information. But we also know that they don't all deal with the same level of granularity. So for example if you search for hippocampus somebody might have statements about CA1, CA2 and CA3. Those are parts of the hippocampus but they're not the hippocampus itself. So you may want to add more terms to your search and we have this is what our ontology does for us. It allows us to go ahead and add some of these terms. But again NIF is a data federation so it will link back to the original source. When you go to these sources they're all very different. They all have their own user interfaces. They all have their own way of presenting information. So you have to spend a lot of time in trying to reconcile all the different things that are there and if you're taken to the middle of it oftentimes it's very confusing. So we try to put tutorials up there that say here's how you use it. So for example, if I went to the three databases that we had at the time that had information about the hippocampus these are the three user interfaces that you would be presented with. These don't tell you anything immediately. There's nothing immediately understandable. You would have to understand what they were looking at, what they were going after and how they were trying to organize it. But NIF because it's run by neuroscientists said well you know all of these databases despite the fact that they have many different ways of saying that A connects to B and many different granularities were pretty much all saying the same thing. They all organized information according to the way that they thought best but if you actually looked at what they were telling you it was always that brain region A projected to brain region B there was some evidence and some strength. So NIF was able to align all of these and if you look across here it's quite obvious what we're saying right here. Despite those complicated interfaces it was just a very simple statement of A projects to B. Now you will notice all the different terminology that's used here CA1, field CA1, all these different things. This presents a problem and we didn't try to reconcile it at that level but at least we could reconcile the data models to a certain degree. So this again led us to create NIF's three simple rules of if you're going to make a data resource how is it that you need to present your data in order for someone else to be able to find it? This is not your colleague that you go and say oh hey I have a database this is NIF right. So if there's an automated agent that's going to be able to access your database there's three things that have to happen. One, first of all you have to be able to find it. Now when I start talking about ontology and this ability to sort of formalize our conceptualizations of the field the first thing every individualistic scientist does is balk and say why should I use your terminology? I want to use my own terminology and I'm like you may do whatever you want but if nobody can find your lovely resource you're certainly not going to have much of an impact. So this is why you need to sort of think globally. It should be accessible through the web so downloaded PDF files that sit in your desk drawer not particularly accessible so you want to make it accessible should be structured or semi-structured and it has sufficient annotations so that an agent would reasonably be able to find it if you had something in there that was relevant. Secondly you have to be able to use it. So we heard from Mark before about the fact that data should be in tables. We find a lot of supplementary material people publish a table and it's a JPEG image an image is not a particularly usable form for data. Data needs to be actually accessed. You can also, you can publish affimetric chip data sets but you can't read it if you don't have the custom software. So you really want to make it accessible. And finally, and this is going to be the bulk of my talk you have to know what the data mean. And what do I mean by that is that there has to be some semantics in there about what it is that you're actually looking at. There has to be some context, experimental metadata to scientists how you got your answer is more important than the answer itself. If you don't understand those, you will not trust it. And something about the provenance where did this thing come from if I got it from something else, okay? And again, what I'm saying is that being able to use an information framework makes this a lot easier to do. So here's an example of semantics. What does 42 mean, right? But what's the real answer here? Exactly, the ultimate answer of everything. So it's the meaning of life for those of you who read The Hitchhiker's Guide to the Galaxy. So the number 42, though, is somehow unsatisfying. I don't know what it means. If I have no context, I don't know it. I can't do anything with it. NIF sees a number of resources where this is the value that is inside of the column. It's one. What does one mean? It can mean anything, actually. It could be an age, it could be a male, it could be present, it could be cerebellum. If you're producing your resource for yourself, one is entirely internally consistent. If you pull it out of your resource, it is not consistent at all and you don't know what to do with it, okay? Now in this case, it's a very simple substitution. You just say it's present or it's not present or it's male and it's female. But again, that's extra work. So here if we look at NIF's integrated gene expression view where we take data from the Allen-Brain Atlas, the Gensat Atlas, and the Mouse Genomics Institute Atlas, again, we see that they all talk about genes. They use a common gene symbol, so that's great. We're able to sort of pull it together. The gene names are consistent. Brain structure, well, they're all talking about cerebral cortex and we'll get to what that really means in a moment. But if we look at expression level, we see that Allen has 23. We see Gensat has moderate to strong and MGI actually had one, but we changed that to true or present. When we got all three of them on the phone and said, is there any way we could reasonably parcelate this data so we could translate 23, Allen said no. It only makes sense inside of the database itself. It never makes any sense in a global sense. The same with Gensat. If you know how the Gensat database was created where they have GFP tied to specific promoters, it tells you about the strength of GFP production, but it doesn't tell you anything at all about the gene. So there's no universality to this. There's no way for us to really answer this question. You have to go to each source and understand it greatly. Sometimes that's necessary. Sometimes, again, simple substitutions will do. So I'm gonna talk a lot about the scourge of neuroanatomical nomenclature and actually have a poster on this at the upcoming meeting. But if we asked, well, does cerebral cortex actually mean the same thing to everybody? And I don't think it's a surprising answer that the answer is no. And sometimes we don't know what it means. People talked about cerebral cortex, but they didn't tell us what they included as part of the cerebral cortex. Some of them did, but you can see that they all had slightly different definitions, different terminologies, whatever it was. So there was really no good way to know whether you were comparing apples to apples and oranges and oranges from the sorts of information that we were given. So how do we report information so that it is much more interoperable so that we can realize things like this web of data or whatever it is? Well, again, I emphasized already that having some sort of framework in which you report the data is very, very important. And by a framework, again, it's several sorts of organizational principles, operational principles that will say, well, let me pull A and B and bring them together so that you can see them. So we have two frameworks that have become very popular on the web today. These are not the only frameworks, but we all know about Google Earth and we know that if we have geo-coordinates of all the different things that we have referenced to the Earth, we can use that as a large data integration engine. So I can take everything that occurs in a place and I can say, oh, I can bring these things together. It really doesn't matter what it is. We have things like Wikipedia and other things which allow us to bring together terminological knowledge, right? So terminological knowledge, the words we use, the terms we use to describe things are extremely important. The first thing that happens when you become a graduate student in neuroscience, you don't tend to be plopped in front of amount of numbers, you tend to be introduced into the concepts that define the domain. So you're introduced to the fact that there's a brain and there is such a thing as the cerebral cortex and you're given an introduction to the sort of high level organization of the, of the domain. So there are efforts going on in INCF and other places to establish common coordinate systems. We heard David have mentioned this, I think, earlier. And that's a very powerful thing. So Allen brain, Paxonos, they all have different coordinate systems but one can do transformations to reasonably align them. We saw with human brains, it's not perfect but it's better than nothing. If you want everything that goes on here, at least you can pull that if you know something about the spatial coordinate systems. But I like to talk when we're in dealing with monthly scale information about, again, the space limitations. And here we can see all these different images of the cerebellum and some of them are models and some of them are micrographs but there is a relationship between all of these. But I defy you as a computer processor to necessarily be able to fit all of them together. You might be able to do it based on color and things because they come from, this one, for example, comes from the same data set as that one. But in fact, there's very little relationship between this electron micrograph and the thing that you know here. What ties this together is this, that I happen to know there's something called the cerebellar cortex, I know something about the cells, I know something about the parts of cells. So when I go over to one graduate student and they have a wave trace because they're recording from a Purkinje cell and they produce a little squiggly thing and I go over here and somebody's looking at an electron micrograph of the cerebellar cortex, I said, ah, I can bring those together. If I went on data features alone, I'd have a heck of a time bringing them together because the data features are not in common. It's a huge challenge. Even humans have a hard time, as good as we are at spatial processing of going between electron microscopy and light microscopy just because the contrast mechanisms and scales are so different. So this brings us into the realm of ontology, all right? How do we establish a shared conceptual framework that has some machine processability that we can use in order to integrate information and tie information together? So ontology in its formal definition is an explicit formal representation of concepts, the relationships and the relationships among them within a particular domain that expresses human knowledge in a machine readable form, okay? Machine readable again means that the computer can perform the same sorts of operations on it as you would. So a very simple ontology would be something like a brain has something called the cerebellum, has a part that's called the Purkinje cell layer, Purkinje cell layer has a part called the Purkinje cell and the Purkinje cell is a type of neuron. Ontology actually arises from philosophy and it is as old as Aristotle. Philosophers have been arguing about the nature of reality for many, many, many, many years. What makes this interesting in information systems is that we now have languages that can encode these relationships that allows us to reason over these types of statements. The simplest type of statement is a neuron is a cell, a Purkinje cell is a type of neuron, therefore a Purkinje neuron is a type of cell, simple subsumption, but you can do a lot more as we'll see elaborate reasoning on top of this. But the whole point of this is to be able to take that conceptual knowledge that we have, the knowledge that I introduce you to in my introductory neuroanatomy class or whatever introductory class I have, formalize it to the extent that a machine can do some of the same type of inferencing as you can. Now, I know people's hackles are already getting up and I know that people are gonna go but, but, but, but, and these are all legitimate and all realistic criticisms. But what I'm hoping to show you is that a lot of what people have heard about ontology or dismiss about ontology was fairly earned by a lot of the dissension that's been going on in the ontology community and its interactions with the domain scientists. But we believe at NIF that they have a very, very reasonable place in data integration. Indeed, we don't have a good way of not dealing with ontologies. So if you're gonna try to integrate data across databases and things, ontology is an essential tool, but I will not make the case that it is the only tool or that it solves all of your problems. So let's look a little bit more into ontologies. What are they? How do we construct them? How do we use them? And how do we navigate what I call the ontology wars? So NIF has assembled a very large ontology for neuroscience. It's called the NIF Standard Ontology. It's a set of modules. Each one covers a single domain and level of the nervous system or level of the nervous system. So we have things like organisms, anatomical structures, cells, dysfunctions. I can't turn off that growl. So you'll just have to see who's online, all my friends, nervous system functions, subcellular structures. We aggregate these from a lot of ontologies that were already in development by the community. So NIF pulls them in. It organizes them and unifies them. It's a very simple ontology because we constructed it in a way that we hope facilitates its reuse as building blocks to say more complex things and more formal things. And we'll talk about that in just a moment. So what can ontology do for us? Why is it that NIF turned to ontologies when it was dealing with this data integration problem? The main thing is that it expresses neuroscience concepts in a way that is machine readable. So if you type in Ammon's Horn, NIF immediately says, ah, this is a synonym for hippocampus, I'm gonna expand it out. So you're able to type in whatever term you want, but it knows that all these different terms point to the same concept. So you don't have to worry about all of the terminological variability. It also provides us ways of defining certain classes. Those can be used very effectively in knowledge bases to allow us to look for consistency. So you said that a Prokinti cell was a type of cell. If I define it appropriately and it gets classified as a type of brain structure, I know that there's a mistake. Most importantly though, I think it provides the idea of ontology is that it provides the universals for navigating across different data sources. So ontology in my view is about universals, not my particular data model, not my particular database, which is very application specific, but it is what is the universal knowledge that's required to tie the domain together, such that again, if I could visit all 2000 databases, I can use that knowledge to pull together the things that I need to pull, okay? So it provides a very powerful semantic index. It performs some reasoning. It also allows us to link data, not just through the concept itself, but through relationships of this concept. So for example, that very simple ontology that I showed you of the cerebellar cortex, I can take a piece of a Purkinje cell and through relationships, I can say yes, if you're looking for cerebellar cortex, this is relevant to you because this is something that is found in cerebellar cortex, okay? It also allows us to provide the basis for more concept-based query to probe in mind data. And this is a very important thing to remember. When you go into Google, you tend not to type in very structured queries where you say I would like humans who are at least 18 years of age or whatever it is. You tend to type in general concepts, like I'd like to see adult humans or I'd like to see data on whatever. The system translates that and sort of tries to pull back the knowledge for you. But this ability to issue concept-based queries is very natural, especially when you don't know what's in the system. If you're sitting there in front of a database of MRI scans where it says I've got humans and I've got MRI scans, you might be able to issue a very, very complex and numerical type query. But if you're just out on the web and you don't know what's there, you're gonna issue a very general type of query. And there are certain natural ways of us grouping classes together that make sense in a concept-based query and I'll show you some examples of it. I'm not gonna talk too much about this latter one, but actually as a branch of philosophy, it does actually make us think very deeply about the nature of the things that we're studying. We've sort of heard this alluded to already in several of the talks that oftentimes these models become so real or our ideas of these things come so real, we sort of forget that this is a very artificial system that we're studying. So Synapse to us conjures up this beautiful hourglass-shaped thing with all these different vesicles, but if we really start to think deeply about what the nature of a Synapse is, you can see it's actually not so easy. Is it a junction? Is it a place where functional communication happened? These things I think do bear some additional query. So ontology can help us really try to work through those if we care to let it do it. So how does NIF use ontology? One of the first thing it does is it uses ontology to help with this terminology problem. So we can do so much with synonyms, but sometimes people just use custom notation like one or custom abbreviation and no synonym is gonna get that out. So there's a database called the SUMS database, which is a database of functional brain activation, and they have Broadman's area, but they use something called Broadman dot three instead of Broadman's area three. We decided that that was not a legitimate synonym. It was a custom abbreviation. So what we did was behind the scenes is we mapped Broadman's areas to their equivalent classes in the ontology. Now as I'm gonna show you in a minute, all the ontologies have a unique ID and that helps us disambiguate these different ways of expressing things. Go away. Okay, so we explicitly map database content and it helps us to disambiguate non-unique and custom terminology. We also again have taken the liberty of translating various neuroscience concepts into automatic query expansions. So there's certain things that we gambled that if a neuroscientist were looking for, we knew that we sort of knew what they meant. So a good example is GABAergic Neuron. If you query for the string GABAergic Neuron, if you put that into Google, it will look for every place where that term occurs, GABAergic Neuron and it might use some fuzzy search so it gets GABA, it might know something about gamma amino butyric acid, who knows. But in fact, if I were looking at a picture of a Purkinje cell like you saw before and I were looking for GABAergic Neurons, I'd say, oh, there's a GABAergic Neuron because I happen to know that Purkinje cells use GABA as a neurotransmitter. Now maybe that study was looking at dendritic properties so nowhere in there did they say that this was a GABAergic Neuron but still as a neuroscientist, if I were looking for anything that's interesting about GABAergic Neurons, I would pull that out. So Nif has a rule that says, okay, if you're a GABAergic, you know, if somebody's looking for GABAergic Neuron, we're guessing that you probably want to know information about types of GABAergic Neuron and the way that we've constructed our ontology, the ontology automatically expands to all the different neurons that we know that use GABA as a neurotransmitter, okay. And Nif has actually gone through and said, you know, we can reasonably guess that there are several things that one would like to be able to query conceptually. So for example, drugs of abuse. If you type in drugs of abuse to Nif, it takes the entire list of drugs of abuse that were listed by the National Institute on Drug Abuse and it says, you're probably gonna look for these. Makes it a very easy way to sort of query across all these data sources. So here you can see, for example, drugs of abuse is returning morphine and it returns cocaine and it returns heroin. Adult mouse is another one, increased expression is another one. We've said, well, we're guessing that if you want increased expression, you're looking for anything that statistically was declared statistically significant by the author. We're not saying it's right or wrong, but we're saying that's a reasonable interpretation of increased expression. We even interpret things like adult because people say, well, that's kind of ridiculous because there is no age at which something magically becomes an adult. We're like, yes, we know that, but do you know how old an adult squirrel is? You can't really ask by age if you don't know when the stages of maturity are. So if you went to Google, you'd say, I'd like to see adult squirrels. So our whole thing about these concepts is that they're not meant to be, they're not beyond argument, but they're meant to be what every neuroscientist would reasonably know and a reasonable definition based on consensus. But the idea, again, is to bring the data there so that you can do further selection and further refinement according to your own particular definition of adulthood. So we've done things like drug of abuse, age categories, expression level, non-human primate, a very useful query to be able to do so you can pull all data about monkeys and other sorts of things that are out there. And again, our line is that this should be arbitrary but defensible because there is no hard line on most of this, human knowledge is fluid. These categories are useful ways for us to communicate about data, but there's no hard and fast rule about them so we don't elevate them to any sort of universal truth. We say this is general domain knowledge. Okay, and again, this is why we define them because we're thinking about how people are gonna query the data. So now let's like a little bit into these ontologies. So NIF clearly makes use of them. We think that they're necessary. Somebody has to construct them and somebody has to be able to take their domain knowledge and transfer it over. And so I've actually framed this as wading into the ontology course because I don't know of any other area where there has been so much misunderstanding that has led to such bad feelings amongst the different communities. And if you actually look at the papers, I gave you one of them which said Mathematics, Computer Code or Esperanto. You see the different communities view ontologies very differently. There was actually a paper that was published back in 2003 that used the feuding between the Montagues and the Capulets in Romeo and Juliet as a frame for trying to understand ontology. And that's because there's been these very dominant personalities and dominant schools of thought that have caused so much dissension that I think it has obscured the fact that ontologies have a very critical place in a lot of what we do. So here you can see two households both alike in dignity, in fair genomics where we lay our science. One comforted by its logics rigor claims ontology for the realm of the pure. These of course are the theoreticians, the philosophers who claim that you can have an overarching theory of everything. The other with blessed scientists vigor acts hastily as a model that endure. And that tends to be the domain scientists who are very much driven by practicality. They create hasty artifacts that tend not to be very good but they're very, very useful. And from ancient grudge breaks into new mutiny. I just came back from the ontology conferences in Graz, Austria and the same battles are still going on as we're going on in 2003. So what we can see is that each has their own agenda about what constitutes an ontology and how they should be built. There's a trade off between rigor and practicality. There's this idea of everybody creating their own customized view of the world versus having some universals and community ontologies. And there's collisions between even realism and concepts. So I'm gonna go through some things about ontology but constantly reminding you in places where again, there's general disagreement and more importantly how it was that Nif navigated these waters. So the first skirmish is that when you bring up the word ontology, again if there's an ontologist, a computer scientist, somebody else in the audience they'll say, well that's not an ontology and they'll argue all day about what the meaning of ontology is. I don't consider that a useful argument. But basically what it says is that there are a lot of things that sort of look like ontologies and whether they actually are ontologies or not does it really actually matter. So when we talk about terminology and human knowledge, we know that there are controlled vocabularies and people who set up databases say, well you have five values that you're allowed to use. And those are very useful things because at least you don't get 60 variants of the word hippocampus and they're cheap to construct and easy to implement but there's nothing about them. You don't really know what they mean. It just said these are the values that I'm allowed to use. There are things called lexicons and thesauri and lexicons tend to have more formal properties of terms and so it will have a definition. It might aggregate synonyms, so you saw how NIF used these synonyms to help expand query. It may have lexical variants, abbreviations, acronyms and one can use more or less rigor in sort of defining these things but again, very useful things. Sometimes people organize these into taxonomies and a taxonomy is just a class hierarchy so the classic one is the taxonomy of life and you have a mouse is a rodent and a rodent is a mammal and a mammal is a vertebrate. So you have these sort of single axes taxonomies and then you have what some people would consider full blown ontologies and they claim that they're distinct from these. I would argue as does Mike Berman that these are all in a continuum, right? That they all kind of interplay with each other and they mix and match and you should use them as is necessary at whatever level of complexity there is just like F equals MA depending on what it is that you'd like to do with it. But an ontology generally also has relationships beyond simple is a hierarchy. So beyond simple a mouse is a rodent it might say a mouse has a head and it might say something about the physiology of the mouse you can just say more complicated things. So you tend to have more relationships. What these things are are independent of how they are encoded. So we're gonna talk a little bit about the languages people use to actually create ontologies. The two ones that are most commonly associated are RDF or the resource description framework and OWL for web ontology language. But again, you can see on this axis here that these are on a continuum and depending on the expressiveness you need the type of reasoning you need to do you may choose to go one way or another way. But again, they're not mutually exclusive with one another. So how do we build ontologies? Here are some of the issues that we have heard over the years and also that we have encountered in our ontological odyssey as we call about it. So there are things about realism versus conceptualism. Can't I just use RDF? Why do I need to go to OWL? How do I name all my classes? What are shared versus custom ontology single versus multiple inheritance? So I'm gonna go through some of these right now and again, tell you how NIF navigated these waters even though again this may not be universally agreed to by everybody. But one thing NIF has found is that the extremes on either end are almost universally wrong, okay? So that you can be too simple and have no semantics. You can spend all your time appropriately placing Purkinje neuron in its glorious hierarchy and at the end of five years have classified one neuron. That's not particularly helpful either. So generally, with all of these things there's a useful middle way that lets you move forward reasonably so without getting you bogged down. But it should be also noted and again, this is something that is not agreed to in the ontology community. That as research scientists, our sole goal again is to grab data. It is not to encode everything we know in an ontology. We don't think everything is decidable. We don't even think it's desirable to encode everything in ontology. But there are some people who do, okay? We also have a budget and deadlines. And so a lot of what we do was not constrained so much by absolute best practice and ideological purity, but it was based on the fact that the grant runs for so long and you've gotten so much money and you need to make choices. And in NIFS navigation across the entire resource space, we have not found a single group that does not operate under those constraints, okay? Everybody is resource limited. We also believe that again, as we're all trying to grapple with this and as at the same time, we are grappling with representation of human knowledge. You've got text miners, you've got Google, you've got all these other people who are using other models to extract human knowledge that there's going to be at some point an intersection or collision of all those. So we really don't know that it's worth spending a whole lot of time doing it perfectly here because that all might be wiped away next week. So we believe that you evolve from the less formal to the more formal and you set up your structures to allow that to happen. It is also true that building ontologies is difficult, even for limited domains, never mind all of neuroscience. Neuroscience is actually a relatively poor candidate for developing ontologies just because of its complexity. But we've learned a few best practices that we actually think are quite useful and I'm gonna tell you here. First of all, trying to make what you do reusable and we heard about this before is always helpful, whether you can actually reuse it or not is another thing, but people should try. Starting simple, adding more complexity, more than anything, avoid the religious wars and separate the science of ontology from the actual informatics that you need to build your information systems and this will let you get a lot farther. So some of the things I'm gonna talk about is numerical identifiers, unique labels and single asserted simple hierarchies. So skirmish number two, the second way to really anger an ontologist is to use the word concept. They hate the word concept or at least some of them hate the word concept. That is there's a whole school of ontological thought that said we're not dealing with conceptual entities, we're dealing with real things that are in the world and if there's no real thing, how can I be sure about what I'm describing? And the scientists and others keep saying, yeah, but mostly we're dealing with conceptual entities here, we don't really know what we're describing and they go back and forth. So you will hear terms like concept, entity, class and instance. They have definitions, they have different definitions depending upon whom you speak to. I think for the most part at this level, it doesn't matter, okay? And what's clearly the case and I'll talk a little bit about this in my poster, by NCF, is that we go back and forth between these two things. We go back between a realistic view and a conceptual view and I don't care. I figure it's the ontologist's job to adapt to us and not the other way around. But we do wanna distinguish between a term which is a label, a lexical label that we apply to a concept and the concept itself. Those are two separate things. So we'll talk about that. But we will talk about classes, concepts, entities and I will use those interchangeably. Instance, I'm not gonna talk too much about but generally an instance is something in the real world that represents that class. So I am an instance of a female, I'm an instance of a professor, I'm an instance of a lecturer, okay? But again, not all people agree on these definitions. Skirmish number three, representation of ontologies. So you have XML, you have RDF, you have OWL. How much semantics do I need? Generally, as I say, you should have as much semantics as you should need and you should not have any more, okay? Because as you saw on that arrow, the more semantics you build in, the harder it is to build them, the more logical consistency you need and actually the more computational overhead you have in trying to reason across these very large graphs. So a lot of people just encode in when they're dealing with a database in XML, I'm sure everybody familiar with XML? Not, okay. And XML really has no semantics. It talks about attributes, a human reading it can read into it, but XML itself is very mute. It's just sort of telling you about the structure of data. RDF, which is the resource description framework, which is the language of the semantic web, actually tries to decompose that knowledge into small pieces or nuggets that actually do have some semantics. And by semantics, I mean meaning. And that unit, that small piece, is called the triple. A triple, in the most simple sense, is a subject, a predicate, and an object. A Purkinje cell is a type of neuron. That's a triple, okay? So you see these statements all the time. Purkinje neuron subject has a neurotransmitter, GABA. So if I were to ask an information system, what has a neurotransmitter GABA, it would say Purkinje cell, okay? As we'll see in a moment, the machine really doesn't know what has neurotransmitter means, but it can read the label. And I've structured it and I'm allowed to ask that question. There are ways of adding additional conceptual knowledge on top of RDF, such as RDF schema, which I won't talk so much about. And then there's OWL, the web ontology language. They have a whole discussion about why it's not wool and why it's OWL, just because for a prosody's sake. But OWL is essentially a first order logic, predicate logic, and you can say more things in it and place more restriction in it than you can say with simple RDF, okay? Sparkle, which I'm sure I'll mention, is just a query language that is used to query RDF, okay? So you'll hear terms like sparkle and point, and that just means that you've set up a node that can accept RDF, an RDF query. I found on the bottom here this excellent introduction here about RDF, and it's by far and away what I thought was the most understandable introduction to RDF that I've ever seen. So what do we mean by increasing semantics? Well, in our relational model that we saw before, again, as a human, you would read into this that there's something called a subject, which is a role of a study, and that that subject is a mouse, and it has an attribute, which is 50 days, and that 50 days may be considered an adult, and it was used in a protocol, and I could say all these things about it, but in fact, the relational database doesn't know anything at all about what those relationships are. It just knows that there is a connection there in some way, just because of the way it's structured. So the fact that the subject is a species mouse, you can derive that information from it, but it's not like the database itself knows this. That's why you cannot say to the database, hey, find me anything that's named that has the age of 50 days, because it can only go down, select from table where x is true, and not up. I can express that same sort of information as a triple. So mouse has age 50 days, depending on how my relationships have been declared, it can go backwards and forwards. I can say a protocol uses an instrument called a confocal microscope. I can even set up a little rule that says a confocal imaging protocol is a protocol that uses instrument confocal microscopes, so I can define things in terms of other things. RDF is capable of expressing all that type of information, and you can see it's very, very powerful for getting things out. You can also see that it's sort of schema-less, that is, I can have as many triples as I need, and those triples have no data model or other schema that they need to follow, so I can take a set of triples from this database and a set of triples from this database, and to the extent that they reference the same thing, I can mash them all up, all right? And again, the computer really doesn't know what these things mean, they're just labels. So there's no logical sequelae of saying that this thing has a neurotransmitter, it doesn't mean that it's satisfied the five criteria of neurotransmission or anything else, it's just a label, all right? But you can still get a lot of useful information out of RDF. In OWL, we're allowed to express more things, so some of the types of definitions or restrictions that we can place on classes actually allow us to reason over this. So for example, there's a relationship inside of OWL that's called disjointness, and there's a logical consequence to disjointness. If I say class A is disjoint from class B, it means somebody cannot be a member of both classes simultaneously, and therefore, when you do a reasoning over that, if somebody is declared to be, subject is both a reptile and a mammal, it will come back and say, no, I'm sorry, you can't be both a mammal and a reptile because you told me those two classes are disjoint. You can't really easily do that in basic RDF, you can be whatever you want, okay, because you haven't restricted it. There are also things like universal restrictions and existential restrictions. So for example, if I take this statement, the thalamus projects to the cortex in mammals, in RDF, you can't tell the difference between, well, there's at least one mammal where the cortex receives projections from the thalamus, or all mammals, in all mammals, the cortex receives projections from the thalamus. If this in fact turned out to be true, OWL would allow the expressiveness for you to say that by declaring a universal or an existential restriction. So OWL is just much, much more expensive. You can also see from some of these things that in biology, for the most part, we're a little uncomfortable about making these all or none statements. And so we don't always use all the expressiveness because in fact, we know that most of it's not true. Yeah. So I mean, in particular with these statements that you often read in papers, if you go over, let's say, 10 papers, you'll find two papers that say the thalamus projects the cortex and five papers say it doesn't. Right. So is it possible to also express this accumulating of statements? So ontology languages are not good at that. And there are some people who think that you should engineer that sort of fuzziness into it. I actually believe that again, what you're talking about are instances here. And the instances can say they support the model or they don't support the model and you could compute statistics over those instances to say that there is a probability of X% that this is true. But this is one of the places that the two clash because when you're talking in the pure domain, they're only talking about things that are true and not all the things that are not true. And so therefore, probability doesn't. So and if you have a statement that we have now another reference that said in 50% of the animals that we looked at, we find the thalamus project to the cortex. So you could, in an ontology world, use a existential, right? So it is true that at least in some animals, it projects there. But you cannot say that all animals project there. So that statement you could model. But this is why again, you have to be careful because we know that this sort of inconsistency is all over the place. So that's why we believe that the data belong with the data and the statements you make about the data are a little bit different. So let's talk about some key features of ontologies that again make them very, very useful. The first thing is that it establishes a unique identity. Any class, any category, concept, entity, whatever you wanna call it, has its own unique identity and that unique identity is not like any other identity. So a nucleus, a prokinzy cell, whatever it is, it has its own unique identifier. And it doesn't matter what you call it. You can call it the number one. It is still a prokinzy cell because all its attributes and its definition and everything else say, this is a prokinzy cell. So they should be uniquely identifiable and we can use those identifiers as you saw in databases to help us disambiguate things. Ideally, this name should be a meaningless identifier, something called a URI or a Uniform Resource Identifier. We tend to discourage, as I'll show you in the minutes, the use of the actual name as the name of the class and there's several reasons for that. But the interesting thing about it is any number of human readable labels can be assigned to it. So you can use whatever language you wish. They all point to the same concept. The definition is meant to be both human processable and computer processable. So oftentimes there's a textual definition that goes with it that will say a prokinzy cell is a type of cell that's found in the cerebellum. If you look at the graph structure and the restrictions, they should pretty much say the same thing. So the computer needs to be able to reason to the same thing that you do. So they often encourage you to give these definitions in terms of genera, this is a type of X and also differential, which are several properties that differentiate it from other members of its class. And depending upon the type of class, you can include these necessary and sufficient conditions using our restrictions that would allow you to automatically assign membership in that class depending on the properties you give it. And again, this is independent of the implementation. So whether it's RDF or XML or just a list of words, these things are independent. So what are some good naming practices? Again, meaningless IDs. And the reason why we like meaningless IDs is because A, they uniquely identify a class through an identifier and numbers are cheap and infinite whereas terms are in limited supply. If for any reason you need to retire that class, the class must go away and it must never come back again. So I always say, do you really want to retire a perfectly good label just because you made a mistake in classification? No, but if you retire a unique identifier, the labels can be repurposed, okay? Even though it's messy looking, I'm gonna talk about interoperable ontology. So if you use terms from another ontology, the tendency is to import them into your own resource and reassign identifiers. But this leads to the necessity of us to keep mappings. And so it's really better to just keep that source identifier even though it makes your ontology look a little messy because you have all these different prefixes going around. Just because the class is named by a unique identifier, it should also have a human-readable label. And the reason I have this Janus coin here, the two-faced god of Roman mythology, is because ontologies need to be understandable to humans, but they also need to be understandable to machines. So I actually take the ontology as Esperanto view, meaning that this is a way of communicating, which means I need to know what it says in order for me to appropriately judge and utilize it. So all the ontologies that I'm showing you in these slides have IDs, all the classes, but I don't show them to you. I show a human-readable preferred label. But each class can have many, many synonyms. And even though the human-readable label should be unique, that is an unambiguous, so nucleus can be the nucleus of a cell, the nucleus of the brain, or a nucleus of an atom, it's far better when you name it to call it nucleus of an atom, nucleus of a cell, nucleus of the brain, because then to a human it's unambiguous. So every class should have its unique label, but the synonyms need not be unique. So nucleus is a synonym for all of those. And if you use the term nucleus, it would need to know that that can point to more than one class. There are also some very bad practices that have been used very frequently, which makes, if you look at any site like the BioPortal, which is an aggregation site, difficult to use. So if I type cerebellum into something like the unified medical language system, which aggregates and unifies ontologies or BioPortal, I will get 50 cerebellums. Some of them point to the brain region. But some of them, if you actually read the definition, it's something called neoplasia of cerebellum or tumor of cerebellum. The reason why is they have a hierarchy called neoplasia and underneath it, they have cerebellum, which is a proxy for neoplasia of the cerebellum. Not a good naming practice, because again, if I pluck this out of its hierarchy, it becomes ambiguous and difficult to understand. So again, always thinking universally about your data is important. So everyone says, why do I need these unique IDs? We're gonna use statistical processing. I don't need to bother with this markup. I don't need to disambiguate. Everything's gonna figure this out just by the statistics. And I said, that may be, but I'd like to think that we should help along this process. And so using myself as an example, why are strings not enough? I give you here one, two, three, four, five, six papers that are by M-martone or ME-martone. They deal with wickeys, ontologies, electron microscopy, amnesia, traumatic brain injury. Some of these are my papers and some of these are not my papers. Even though there are not many M-martomes, there are more than one, okay? So if we look at those, could we guess based on the domains, which ones were mine and which ones were not mine? And you might say, well, traumatic brain injury, amnesia, those are probably the same M-martone. Actually, turns out those are not the same M-martone. One of them is my cousin, Marilyn, who studies traumatic brain injury as a philosopher and I used to do neuropsychology. So if you look at it, you can see that sometimes I'm ME-martone, but sometimes I was M-martone because when I first started, I didn't use my middle initial, all right? And you might say, well, statistically, we would be able to pull that out. So it might be that a statistical algorithm will go, well, there's an M-martone and an ME-martone, so I'm gonna assume that those two are the same and I'll connect those up. We already saw that that was wrong because ME-martone could sometimes be M-martone. Maybe I'll look at it by looking at the domains because surely somebody is always in the same field and they never stray from it. So I'll say, well, amnesia, traumatic brain injury, caught at nucleus, those things could co-occur together, maybe that's all part of one of these M-martones. Again, that turns out that that's wrong, okay? If I actually look at what this graph looks like, I see that there are three M-martones in there. One is my cousin, Marilyn, who's related to me. The other is an M-martone I don't know who actually worked on the Gene Wiki. There's me, who was M-martone, who's the same as ME-martone, and I have these different domains. So actually by using unique identifiers to pull this out, you get a much more accurate map. Why is that important? Well, in this world of linked data and this mashup, I can tell an awful lot about myself. By pulling together all of this information, I can add noise by adding multiple M-martones in there, but I can also disambiguate it by just assigning myself the unique ID. And in fact, there's a project, the ORCID project, that's assigning and doing just that. Everybody who's an author is going to get their own unique ID so you become yourself and nobody else, right? So there's been a couple of editorials that say I am not a number but I should be a number because that makes it a lot easier to do this. It's not impossible to do it otherwise. It's just a lot easier to do it here. And notice I can have all kinds of synonyms, labels, whatever it is, right? Again, sometimes we use very closely related concepts but they're clearly different concepts. So it is not unreasonable to have a statement that say we studied the behavior of CA2 binding proteins and CA2 neurons under high and low CA2 conditions. In one case it's a protein, in the other it's a neuron, in the other it's a calcium, it's an ion. And if you go to NIF and you query across, you can see all the different ways we can be fooled in particular because there are no unique three-letter acronyms that have not been used by a gene someplace. You almost always get a whole lot of noise because it's very difficult to disambiguate these things across. So this actually just causes a lot of time and effort wasted because we don't unambiguously identify entities. Skirmish number four, I want an ontology of my own. We heard this before. I don't want to use somebody else's, I need to have my own. And you know, that's entirely legitimate. A lot of times there isn't an ontology. You feel that you need to get your head around the problem. You've gotten money to build an ontology, so by God you're gonna build an ontology. I can build an ontology better than you can. So NIF has long ago stopped trying to persuade people not to do this. We've just said, hey, if you're gonna do it, let us tell you why that it's not a good idea. So one of the things they say is, well, we're just gonna do this mapping later. So we're gonna create our own ontology and then we're gonna map to some community standard or some community ontology. So somebody over here will make a gross anatomy ontology and they'll start going down to cell and somebody over here will make a cell ontology and they'll start going up to gross brain anatomy but they will pass each other by because we don't know that this cerebellum is the same as this cerebellum. But let's say we did start doing the mapping using automated and otherwise. If you go to the NCBO Bio Portal, they've done all kinds of automated mappings and a large majority of them are wrong. Just because human knowledge is very difficult to grapple with, it's very difficult to match. These hierarchies, there's many different hierarchies you can place these in. So how do you note that the cerebellum in neoplasia is different than a cerebellum, which is brain anatomy, right? We also find that no matter who does this, including NIF, the mappings are never complete. They never cover everything that you want to do. They are always impartial. The other reason why you really want to build off of other ontologies though is it doesn't really allow you to take advantage of this vast world that's out there. So instead of building it this way where I create my own ontology over here and then I do mapping later. If I actually, oops. If I actually started from common building blocks and said I'm gonna take all my brain regions from here and I'm gonna relate them to the cells that I have over here and I'm gonna relate them to molecules that are over here, I can take advantage of all the work that has been done in those communities. So in this case, you can see I have multiple ontologies that have all come together to model my particular domain. But if I went to the IP3 receptor from Kebby, I see that I get all kinds of structural information. I get all kinds of additional information that they added to it. So if we really want to create this linked data graph, at some point we have to have common identifiers because if we have no way to match them up, then there's no way for us to produce this graph. And using common building blocks makes that a lot easier to do. And that's why when we import these ontologies we favor keeping the original URIs so that we know if there's another resource on the other side of the world that's doing something with IP3, we can link those two together. We can mash them up, right? And why is this so important? Again, NIF sees this all the time. So if you look at the connectivity databases that NIF has, we have six of these. BAMs and Rodent, Connectome, Wiki and Newman, Brain Maps with Different Species, COCOMAC, the UCLA Multimodal Database, the Avian Brain Connectivity Database. We extracted 1,800 unique anatomical terms. If we looked at how often we got a string match between the two, granted that they're different species and what have you, there were only 42 terms that occurred in more than one database. So only 42 out of 1800s were a direct syntactical match. If we added synonyms, the number went up to about 100. If we added partonomies and other sorts of relationships, it went up to about 400. So the more we have a model sitting behind that, the more we're able to pull data from these different sources and putting them together and do that in a unified way. It doesn't answer the question of whether A projects to B, but at least it lets us pull to the statements that would bear on that question so we can look at them. So again, NIF, INCF and others are working on ways of translating between these different anatomical nomenclatures because we're not using shared building blocks, but it's absolutely critical that we be able to do so. Okay, so how do we go about building ontology so that they're a little bit more interoperable than typically we have? How do we build them in ways that maximize their reusability? Again, one of the things we did is we made them very simple. So we made them very simple and straightforward. They're modules, they're very uninteresting modules. They basically give you lists of brain regions, lists of cells and lists of everything else. They're little modules. We didn't add a lot of knowledge or complexity in them because the more knowledge and complexity we add, the more custom they become to our applications and not to your applications. And we didn't wanna do that. Also, the larger the graph gets, the harder it is to work with these. So here's another skirmish that happens all the time. You say, well, that's great that you have these very simple lists of things, but how is that useful? You've already told me that purkinje cells are GABA-ergic cells, they're cerebellar cells. Things can exist in multiple different hierarchies. So a good design principle, and I think the one that has been most misunderstood in ontologies is the idea of single inheritance. Single inheritance means that you should only assert a single parent for every class. By assert, I mean I physically place that thing under a particular column. So I say a purkinje neuron is a neuron. Multiple inheritance means that I've asserted that a purkinje neuron has more than one parent. So it's a GABA-ergic neuron, it's a spiny neuron, it's a cerebellar neuron. Why do I not wanna do this, okay? Even though we know biological entities are complicated, if I had to manually place myself in every network in which I belong, I'd spend every minute of my day placing myself and managing my networks. If I have a way of algorithmically doing it, then I don't need to bother. So when we say that we don't want multiple inheritance, what we don't want is multiple inserted inheritance, what we want is multiple computed inheritance. So by using properties and our restrictions, you may generate computationally as many ontology hierarchies as you wish, but you don't wanna manage them. That is because when you put a definition on a class, the minute that condition is satisfied someplace else, it automatically gets placed under that class. So for example, in neurons, we have a hierarchy which says there's a cerebellar purkinje cell, a cerebellar neuron is spiny neuron, a principal neuron, a GABA-ergic neuron, very, very flat, but we have rules in place that define what a spiny neuron is, what a projection neuron is, what a GABA-ergic neuron is. These are very simple rules. And after we use a reasoner, we reclassify our hierarchy so that a purkinje cell becomes part of all of these. So we can use these things. It's just we don't assert them. I'm gonna skip upper ontologies because I think I'm running out of time. But the last one I wanna talk about is the top-down versus bottom-up. This is a big, huge clash. The philosophers say you've got a very rigidly control from the top-down. We need to know exactly what's going on and we need to use very logically rigorous principles to make our ontologies. The domain scientists say, eh, it's too hard. Our domain is not easily modeled. It's gonna change anyway. We don't put that much store in the knowledge. We just need a set of useful terms that we can use. And there is something to be said for both of these. And Nif also navigates this waters by actually creating something called the Neurolex wiki. So all of our ontologies are exposed in the semantic wiki. Each of the concepts gets their own page. But unlike Wikipedia, this has formal knowledge models behind it. So you can link pages based on properties. And again, add these properties so that when you make a modification to a page over here, it automatically gets reclassified and linked to this other page. This is something that we're working on with INCF. And you can see here that if you look at the page of the cerebellum, it talks about neurons in cerebellum, axons in cerebellum, parts of cerebellum. None of that was asserted. It was all computed by rules that are inserted into the page that allows some reasoning to be done on top of the ontology. So I don't really have time to go over that, but the basic idea is that this is meant to be a community-driven, very easy entry. People can contribute their terms. They become immediately available. They're readily indexed by Google because WikiPages are readily indexed by Google. And it also allows you a place to learn a little bit about ontology engineering and other sorts of practices by working with structured knowledge in an environment that's not so difficult. I have a question. So with this, they even like me, they provide always some type of vocabulary. Yeah. And my impression is that people are forced to write down terms, they event them on the way rather than using the one they're actually using. I mean, one example is you're submitting a paper to a journal and they almost always force you to write down keywords. Yeah, keywords. This is a step I actually... Hate. It takes longer than writing the paper to fill out five keywords. Oh, yeah. And if you look at different papers, they all come up with very fancy keywords that nobody is ever going to come up with. So I wonder, isn't that something you can almost do automatically by just mining the text that makes statistics over the long structural words that appear? So the use of, again, these ontologies is exactly that. You know, we actually believe that these are supposed to work part and parcel with text mining and other things that need knowledge models to start with. So you can do these sorts of statistical things, but there's some things that are extremely hard. Again, I just came from a conference with the top text miners and scientific constructs really flummox them in many places. And so, you know, there's a few places where having these knowledge models set up, they can use them. In fact, a lot of the terms in their elects were contributed by text miners who used the basic starting building blocks that we already had available and then used that as a model to pull more things out. So I don't believe that this is an either or. I think actually there are some things that text mining does far better than humans do and that these things should work together. Okay, so let me sort of finish up there. So basically the idea of NIF is that we took a lot of these community ontologies that were sort of tangled together. We attempted to straighten them out with the sole purpose that people could build other things with them, right? So this is not meant to be a be-all and end-all thing. We have not encoded all knowledge. It is meant to be a tool that can be used and built upon and other things can be built with it. And I think that that's just really important to know. There are a lot of ontology tools and services that are there. The NIF provides all of their ontologies through web services. They're available for download. We have some RDF. There's BioPortal, Oboe Foundry, the PONS program that's here. And the thing about ontology that I want to leave people with is that it's very, very useful for some things but I don't believe it is useful for anything. It has its place in data integration as a high level sort of theory of conceptual integration across fields. But I would never say that we should encode everything that we need to know inside of an ontology. I think there's better ways of doing that when it gets down to data. So it's really not about restricting expression. It's not about imposing a unified model. It's about making the conceptual knowledge that we all sit in these seminars and we learn a little bit more exposed, a little bit more machine accessible so that it can be used but it should not be granted any sort of magical powers because we know human thinking is fraught within consistencies. It's fraught with all kinds of difficulties and that's just the way of it. So it's not the answer to everything. It's not easy to understand. But if you look at the NIF system, the NIF really says, listen, there's different ways that we interrogate and organize information. You saw that in the types of databases that are being produced. So NIF actually believes it's a messy world. If I were really writing this in an engineering way, I'd make it nice and neat but in fact I don't think that it's nice and neat. I think that there's all kinds of loops that go on in there. But we have literature, we have databases, we have claims about results in the form of RDF, we have universals that are expressed in OWL. And they all have their place in trying to sort of navigate this very, very messy world, okay? So I also don't want to leave you with the idea that this is completely technologically solved. Even things like unique resource identifiers, the uniform resource identifiers or RDF, there's a lot of discussions, there's multiple ways about linked data. But I put the printing press here and the linked data graph here to remind ourselves that we've been communicating about science in terms of printed articles for 500 years and we've only had the internet for about 20. So I mean, there's a lot of work that still needs to be done here, a lot of stuff that still needs to go on. So it's not surprising that things are difficult. But effective data sharing, as we heard from JB, is very important. It's still an act of will but I think it's quite important for us to put a lot of data out there in an accessible way if we're ever gonna get over this hump. So we do know some things. NIF has a whole lot of rules and things on how you can make your database more interoperable. And the main thing I remind people is just to think about it. To think about it when you're producing your resource, if there's something that you can do to allow some other agent to grab it and use it in a way that it wants to use it and not the way you intended, then you should do it. All things being equal, okay? Just finally, I wanted to let everybody know about this new organization I've just joined which is called the Future of Research Communications and E-Scholarship. And this is a group of people. It's led by Phil Bourne who is one of the founders of PLOS. And it's bringing together library scientists, informatician, text miners, researchers, basically everybody who's interested in how we are going to adapt this new medium of communication and access to moving scholarly communications forward. So there are big discussions that are going on in this community and elsewhere about how we can fix scientific publishing. What's the most effective way of writing something for high impact? What are these new initiatives, these new alt metrics for judging people about how productive their career is? So I've just recently taken over as executive director and I've been finding it fascinating. So I've been telling all my research colleagues that you ought to go and look and sign up because there's this huge world out there where people are considering it. But the main trick now is to bring the actual practitioners of scholarly communication into the fold to say what is it that makes sense? We are no longer restricted by articles, they still have their place. What now is going to make sense in this sort of new generation of scientific communication? That's it. I'm going to bury my writing. Oh, it sounds cool. Write a book chapter. That's what it is. They basically analyze the impact of book chapters. I actually wrote a blog about this. I find that sometimes you do your best thinking in a book chapter because you're sort of free from the constraints. But if you talk to your advisors just say don't write book chapters, they're not counted towards your tenure, they're not counted towards anything else. So the point of this person is they're really useful mechanisms. But why put them in some expensive book that 10 people are gonna buy for $100? Why not write it on the web? So if you're gonna write a book chapter and you're even gonna edit a book or compilation, use something like Wikibooks or something where everybody can read it and then the impact will actually be quite high. So that's what that's about. That's it.