 So this is somewhat of a free-wheeling talk with the title where do we go from here and underneath it is databases and ontologies so there was a lot of discussion and a lot of questions about Scientific data about practices in science about making scientific data available And that's really where I have spent most of my time in the last I would say Decade in fact even from the early days of the web is what do we do with all the scientific data? What do we do with the scientific information? How do we make it searchable and how do we make it available? So the organization of the talk is after some high-level introduction I'm going to talk about the neuroscience information framework, which is essentially a search engine for neuroscience data But it has done a lot more than that in the sense that we have been cataloging surveying and tracking Neuroscience resources since about 2008 and so we kind of know a lot about the digital landscape and some of the Challenges that are faced when trying to make data available We'll go through basics and structured information federating neuroscience data We've heard about data federations talk a little bit about ontologies of information frameworks and then show what I call Identifiers in action an initiative called the resource identification initiative So I'll get through them as much of this as I can And I used to talk about this exclusively from the point of view of databases, but Because we were you know, we've been dealing since digital information and how to grapple with the fact that we produce things that technically are shareable Once science went global. How do you make these things available? So we've had museum collections for a long time We've had various ways of preserving some content But other than that science essentially views data in many cases as a disposable byproduct of research as one researcher put it so I've started to Think about this less in terms of purely sort of information modeling and databases and more in the fact that We have to start thinking about Transforming our whole way of practicing and disseminating science and some of those some of those Transformers But if you kind of think about it this represents what I call the traditional model that is well understood has been honed over Hundreds of years and it basically is the scholar produces the content they go through peer review and get into a publisher The publisher actually deals with the library the publisher really doesn't care very much about the individual researcher That's changing a little bit with open access But I've had the privilege of going to various conferences where I've had the libraries and the publishers there And they view it as their job to sort of interact with each other The library itself of course is trying to reinvent itself because in fact now many people don't ever set foot in a library except as a study hall Because again the whole model of content is changing But we sort of knew this very well the scholar got access to that content back from the library It was always ironic that in fact unless the journal the library had a subscription to the journal that the Researcher published and they may not even be able to see their own research except for what rights they happen to buy back So very nice We've gotten this sort of honed and our entire merit and reward system is sort of based in many ways on this cycle But if we think about it now, we still have the players we still have the library We still have the publishers the peer reviewers and we still have the scholars But if you think about what scholars are producing now They're obviously producing a lot more things that are potentially shareable and it is not just the final narrative And this synopsis of a result as somebody said papers are really stories with data is what they are But we produce narrative workflows data models nanopublications multimedia and code much of which is not well served by a print media or a Electronic version of a print media but in fact require other sorts of things to make them available and I deliberately made this messy because right now this is what our scholarly landscape looks like all of these things go into different Infrastructures they some of them get direct access some of them don't there are various curators in involved There's code repository community databases And I also replace the scholar with somebody called a consumer because of course in the information age You no longer have to be part of a prestigious university to go in and use their library But you've got content all over the place the idea of citizen science or even that Taxpayers have a right to the research that they have paid for has become one of the rallying cries around open access So there's a consumer and it's more important I think to recognize that that consumer is of a dual nature It is a human being of course who ultimately is an integrator of information But it is also a machine the information needs to be machine Processible and machine accessible because more often than not you are getting this via an agency of some search Function okay, or some sort of algorithm or something that goes into this information and gets it to you when I started in science That wasn't the way you know You got a little book every week that said here are all the articles that were published and used to flip through them to try To find it and that was a major advance right because before then you only got what journals you could actually lay your hands on So there's been a tremendous transformation here, and this has only been going on for 20 years So clearly we have not developed a set of scholarly practices citation processes Evaluation metrics to deal with all of the diversity that we are producing and there are huge arguments that are going on about Whether or if or how we should preserve and make access to this and how we should judge these Relative to the narrative that we produce which is our current currency So really the issue of databases and oncologies is more about the fact that again the scholarly Platform has a potential to change. How are we going to change it? How are we going to harness this to drive science forward? So the neuroscience information project was actually founded in 2008 there was a period from about 2006 to 2008 that there was a little bit of a prototype And it was recognized by the National Institutes of Health Which is the US funding body the main funding body for biomedical research that in fact nobody had any idea What their grant money was producing so they knew that they had to invest in databases They knew they had to invest in software tools You saw in the introductory slides, or I think it was in Somebody slides Mark Oliver that there was a human brain project in the United States Which ran in the 90s which was supposed to fund this infrastructure, and it was based largely on genomics where everyone's like Oh, we've got GenBank. We've got Swiss Pratt. We've got these we need those things for neuroscience What's going to work so there was a lot of investment in in trying to build these databases trying to build Software tools other sort of image processing suites, whatever it was. It was largely driven by neuro imaging But that NIH blueprint is an interesting meta structure that sits on top of the National Institutes of Health So the National Institutes of Health are organized around a different disease focus So you have heart lung and blood and you have digestive and kidney diseases And then you've got 16 institutes that have something to do with neuroscience. You've got addiction You've got alcoholism. You've got aging. You've got neurodegenerative disease You've got all kinds of institutes and there was no sort of neuroscience at NIH. There was just 16 institutes So one of the former directors of the NIH Elias Zerhouni said we need some trans Institute initiatives And they created the blueprint and the first thing the blueprint said was I don't even know what we funded This was before they even had a database that dealt with which grants were funded And how do we even track it there was a sense because Google of course came about in gangbusters or in the late 90s That there were search engines and things that could find it But in fact it is still very difficult to find this information as we've heard already So nobody knew how many tools were produced. They didn't know what domains they covered They didn't know what domains were not covered They didn't know how they were being constructed or where they were being Referenced websites databases literature supplementary material. They had went no way of really tracking who used them Who was creating them and more importantly for us was how do we find them and how can we make them better in the future? So NIF again has been surveying cataloging and tracking the neuroscience resource landscape since 2008 It is one of the largest data sets of its type because the web Which we usually use to track these things is a very ephemeral things things come and they go Scholarship of course requires that we know what happened to these things and what state they existed in So it's a very very valuable data set for tracking this So basically this is just a screenshot of the current homepage of the NIF and the NIF is a search portal So there's a search bar and one of the requirements was is that it should be easiest easy to search as google And one of the reasons google seems to be easy to search is because there's just a bar and it says put in a word Now normally when you structure databases or you query databases as many of you know You get a form and it says okay. I want to fill out exactly what I want And if you know exactly what it wants and how the data is structured It's very easy to find When you don't know what's there and you don't know what the coverage is and you don't know how it's structured Actually keyword turns out to be a very effective way of being able to search across large numbers of databases And as funny as it is it was I Understanded a major advance in database query that you could just put in a word and search across Irrespective of the way the database was structured But our first job was to sort of survey Which things are available and sort of characterize them according to type and these were any sort of research resource So this was any type of database software tool material service organization that supported neuroscience And we created something called the NIF registry which currently has about 12 000 of these things that are catalogued Turns out not to be that easy to characterize these things and we're constantly changing it and they're scattered all Around the world And if we look at those we see that roughly about 3 000 of these things represent what we call databases And we'll talk about those what that means in just a moment and data sets So this was I think way more than people expected and they're much more dynamic than people expected They come up they go they leave and um, they get ingested by other organizations But just that sheer number says wow That's a lot of databases And if I had to go to every single one of these databases to find the thing that I wanted You would have a heck of a time trying to find the time to do that Especially as we will see because every single one of them has a different data model terminology Query interface you name it. Okay, so they're very complex entities So we realized very quickly that we had to be able to search across these databases if we were going to be effective So nif also has something called a data federation Basically, it's up to about 800 million records and these 800 million records come from Hundreds of these databases that are distributed around the globe We also query the literature at the same time because this is still the largest source of information for neuroscience This is where people publish because as we know they're very little incentives to make your data available by a database So currently nif is about one of the richest sources of neuroscience information that's available And you can see from those thousands and thousands of databases What in fact we've known for a very long time that if we want to sort of solve the the database equivalent of the genbanker The swiss prodder the pdb for neuroscience We're going to have to deal with a whole lot of different data types So we have a lot of different data types We have a lot of different paradigms a lot of different model systems And so from the get-go At the early days at san diego ecsd working with the san diego super computer center We said there needs to be some sort of federated system There's not going to be one gigantic large data warehouse that houses all this information But we're going to have to rely on these distributed databases that are scattered around and somehow have the opportunity to Query and bring them together. So these are just some very simple definitions A data warehouse contains data from diverse sources But you spend a lot of time fitting the pieces together to make one thing sort of merge into another thing So they take a lot of work And they take a lot of effort to maintain essentially it's one database to rule them all In a data federation, it's basically a virtual database and it has the data definitions It has indexes that work across it may have some common access protocols But these are all independent databases that are made to look like they work together Okay So the the entity that is bringing this together is Responsible for making calls to the appropriate databases bringing them together and meaningfully aggregating the returned result set And I would say in all the things that nif encounters. This is one of the biggest challenges because how do you relatively rank databases data content? It turns out to be one of the challenges of computer science So the nif essentially if we ask what nif is because people say what is it? Is it a search engine? It is this and I said well I like to think of it as a new type of entity for a new model of scientific dissemination That is if you are going to be putting data into databases Just like pub med was a major major advance that you didn't have to go to every journal website You could go to one place that's structured and sort of unified the information We need something to be able to go and search for all of these things that are available And we also figured that we needed to unite neuroscience information without respect to domain funding agency Institute or community because again neuroscience is very very fragmented And it's like a pub med for biomedical resources in that our registry would be the equivalent of an abstract Here's a little bit of a synopsis about what this thing is and here's some keywords to characterize it But we're like a pub med central for some databases the databases that we get ingested because we get to search Inside the deep content of the database We make them searchable for a single interface. We designed the nif to be practical and cost effective The only thing that's sort of equivalent to a nif are the types of resources at ebi and ncbi They have a budget of 500 million dollars in an institute of 500 people nif has a budget of just over a million dollars And it's got 10 people so it was really designed to be something that could be deployed quickly and managed effectively Also the reason why the genomic institutes invest so much money and it is that there's a clear value to genomics data Up until recently there hasn't been a clear value to all this other data. In fact, there's still large arguments about whether sharing Individual data sets or even small data sets is worth the time and money So without a real incentive without the community Having some reason to actually populate these things we deliberately designed nif to be lightweight and agile So that we could serve the community but not over burden the cost Because no nih institute is offering us a half a billion dollars and 500 people and I don't think they will So I want to remind that when we're sort of talking about databases We're talking about specific type of information that right now we call structured data And in the sort of modern web world, it means it's easily machine processable and it's accessible by some sort of query language some sort of call to the database And if we think about the some total of neuroscience information as mark said yesterday, most of it is the unknown unknowns We don't know what everything that's out there But if we look at where the majority Of things that we do know the known knowns or at least the known unknowns in this big part right here Most of it is in literature, which is considered unstructured information and I'll talk about that in a moment Images again, which are unstructured Human knowledge there's a lot of information in our heads Which many of much of which does not get passed on or put into the literature We have dark data. We have non digital data We have a lot of things that are in file drawers and in closets that never see the light of day some of the the questions that are being Raised about reproducible science Part of the reason we have that is because as we've already noted there's a Bias towards publishing positive results and only positive results in the literature the failed experiments as we learned ended up on the ceiling of Hodgkin's Apartment or they end up in a drawer someplace But that causes a bias in the record because we actually don't recover all of the information that's there So there's a whole lot of stuff which is out there, but we have no access to it Which in the web world means it doesn't exist right because I have no way of accessing it So if we think about structured versus unstructured data, I should say that I'm a neuroanatomist not a database Expert so everything you get is filtered through my view of what these things actually are If we look at something like a relational database, which is a very common way of structuring information perhaps the most common way Basically, it is represented by a data model That is there is a structure that is given to the data that represents What is inside there and there are well-defined data types like integers. There's a formal query language like sql So if I have a statement like this mice age 50 days were perfused with 4 per cent paraformaldehyde and brains were sectioned at a thickness of 50 microns That's whole statement might look like something like this in a relational database. It's got a table that there's called protocol. There's types There's something called age and it's an integer. So I know what to do with that When it's in free text like this, I have no way of knowing what this means And there's different things that we have to use like entity recognition Recognizing that this is in fact a mouse and that it's the subject of a study This is in fact an age and not something else And I have to use algorithms like natural language processing to try to figure out what it is that people mean When I have a structure like this If I know that structure I can ask very targeted queries So this idea find all studies where mice are greater than 40 days of age That were used in immuno labeling studies that use confocal microscopy It's extremely easy to answer if you know the structure of this database You might be able to pull this out if you train your algorithms and clearly nlp and these other algorithms are getting better and better But parsing what humans say and what they mean is still a very very difficult task And I think too often we expect them to catch on a lot faster than they do So when we query the nif as I mentioned before we basically divide all the information into three indices You can kind of ignore these the data index Which is the data federation the literature and the registry And there's a lot of people who start projects like nif and they say why do you need to query deep into the databases? And we said listen this was done by practical experience in the early days Which was just a catalog, but I think we illustrate this right here If I put in cerebellum into the nif it says I get 3 million data records and I get 41 entries in the registry cerebellum is a big Big concept and that generally means that people have developed resources that would have been tagged with the term cerebellum Because they're about the cerebellum. So here's an atlas of the cerebellum probabilistic atlas of the human cerebellum But if I query for something like a One gene grim one Notice that there are no resources that are about this particular gene You have a lot of resources that are about genes in fact you have thousands and thousands of them And you've got thousands of articles But if you're looking to accumulate a data set or aggregate data about a particular gene You have to know which ones actually talk about this particular gene So you see that there's 49,000 data records that come from x number of sources that are listed here That in fact mentioned the term grim one inside of it Now you can imagine which the lack of terminology standards that it's not that easy In fact to know that a lot of these different places are talking about grim one So there's a lot that we've had to do on the query and to make it so that it searches for synonyms variants related terms To try to bring back what is found inside of these databases So again, I've introduced some terms here and there are not Complete definitions for all of these that are generally understood But generally we distinguish between data Those are the values that are either you know, they're qualitative or quantitative not all data is quantitative Some of it is categorical for example that belong to a set of items often the results of measurements You have metadata, which is typically thought about as data about data And there's different types of metadata. There's structural metadata Which really talks about the design and specification of data structure So for example, if it's a database that talks about images It might give you image size bit depth integer versus string and that sort of information But there's also descriptive metadata that that deal with other things about that So key words for example or the creator and the subject of the particular database and which agency funded it Those are all descriptive metadata And it's important to remember that metadata are data because oftentimes nif gets Criticized or like well, you don't have data you have metadata. I'm like Well, the metadata points to data and metadata are data, you know, you can do a lot of things with those There's also other things as we mentioned Data type for example, which is the form of the data for purposes of a data operation So we saw that in the previous slide when I say that something is an integer I'm I tell you what sort of operations can be performed on it when something is a string You very much limit what can be done because there's only so many things you can do with strings If I say it's an image again, it tells you the set of operations that you can do And of course the holy grail of almost all of our efforts is data integration and especially in neuroscience The idea that we can take data from these different places and meaningfully combine them in something which tells us Something significant about the system that we're studying So it's important to recognize when you're actually going around though and looking at these data resources that they're not all of the same type And so you have data that's sort of in different stages of processing that's being made available So this is taking a look at some of the thousands of databases You see that there's primary data that is these are data that are the generated measurements or close to The form of the generated measurements that are made available for re analysis So something like the geo micro array database or the x-net neuro imaging database or the Image library makes this data available to you There's secondary data which makes features that are extracted through data processing and sometimes normalization available for query So you've all probably seen or heard of the island brain atlas, correct? And if you look at what you're querying there, you're not going in and directly querying the images You are querying an abstraction and an image analysis of those images so that you can query fast and effectively across all of this information If you had to download and parse each of those individual images, it would be a lot more difficult So just because it's secondary data does not mean that it's very very powerful There's also again, what I call tertiary data and these aren't official terms And that you see a lot of databases because so much information is published in the literature for example That essentially are claims and statements that are extracted from the literature and then made available through the structured form So that they're more queryable essentially in many cases there are annotations or what I call claims and assertion about the meaning of data And you also see that there's different types There's registries, which again are generally high level metadata descriptions of data sets That contain pointers to data sets. There are data aggregators That is people who take data from multiple different studies. Many of these are repositories So you can contribute your data to them Which take multiple different pieces of information usually again not standardized and they put them together So I mean gen bank again would be one of those And then there are these single source. These are data that are acquired within a single context And these tend in many cases to be the most useful depending on the nature of them because again It's all unified. It all works together. It was all meant to work together And generally there's a lot of Utility to these types of things, but there's utility to all of these if you really Kind of look at them, but it's clear again that researchers are producing a variety of information Artifacts using a whole multitude of different Technologies and the idea of nif is that we have to be able to handle all of them So why is a nif needed? I get that question all of the time as well Again, there's many parts to nif and there's many valuable data sets that are in nif by virtue of us actually having been tracking the resource landscape for a very very long time But over the last couple of years, I would say the last five or six years There has been a sort of a consensus from multiple different groups that have all kind of come To a set of conclusion about what it really takes For data to be effectively shared. So nif came up with a list like this I looked at the royal societies report on open science and they came up with a list like this as well Because when you start to look around at all the variety of things that the scientists produce And nif again doesn't just go after big data sets if you came to us and said hey We want to make our data available. We would make it available to you You see that the first and most important thing about data to be useful is it has to be discoverable It has to be found and as we'll talk about in a minute That's actually not the easiest thing in the world because search engines like google generally cannot get at data that is stored inside of databases It has to be accessible So it has to be able to be accessed and the access rights have to be clear for those 3000 databases that nif has We have 3000 10 page licenses that explain under what circumstances you can actually use those data There's a lot of work going on in machine processable licenses, but we haven't gotten there yet There's accessibility that is how good is this data source? Can I rely on it to do my analysis? Can I make appropriate claims about it? It was interesting that when I gave my neuro anatomy students an informatics task where they were supposed to go and Answer questions and evaluate different resources the very first thing that every single one of them said was It was easy to use. I could figure out how to use it. That was their most important thing for accessibility Nobody talked about population. Nobody talked about where the data was coming from nobody talked about who curated it It was hey, I could figure out the user manual Okay, so it's kind of an interesting thing when you think about it The data have to be understood and we're going to talk a lot a bit about that when we talk about ontologies those of you who know data and you get data from a colleague and you see that Column 6 is called 12 12 doesn't tell you very much It might tell you a whole lot inside the context and if your colleague asked you But it doesn't say anything much to a machine So the data need to be understandable and this has many levels And the data need to be usable And this is something where it's perhaps not a shock to the people in this audience or sort of even something that you Would think of but nif gets a lot of data that comes from scientists. They're like you can have my data And it's a picture of a spreadsheet or it's a table that's in a pdf Okay, so not all data is equally useful and actionable And what you really start to see is that there is again this duality to modern scholarship There's a human and there's a machine dimension to every single one of these things and they don't always cross Okay, because putting a table in front of you with big dots and small dots If you're a human you immediately understand the symbolism And you understand what it is that it's being told A machine has a heck of a time even if you have a line in the middle of it in many cases And you have to go through all kinds of things to make that data actionable So there's a lot of ignorance in the biology community because that's not where their skill set is of what it really means to produce data and make data available And the next thing people say is we have google What do we need this for and google is a fabulous fabulous thing and I keep hoping that we will be replaced by google We're doing our best to make sure that we expose ourselves by google and I'm going to retire but basically The current web is really designed to share documents It is not meant for data and even google has a hard time with data Much of the content of the resource is called the hidden web the dark web the the deep web And that's because if you think about what gets returned from a dynamic database query It's not a static url. It depends on the contents of the database. It depends on your query So if google happens to be crawling right when you've generated that dynamic page depending on how you produce it It might index it But by and large these are dynamic things and it has a hard time actually understanding them And if we really kind of think about it, we again are in this sort of midst of a tremendous revolution You guys were all born in the middle of it But i'm old enough as are many in the people in this room to remember a time when all of our data was in books Okay, that's the only place the only way we had to actually put it Or in libraries we then started to get personal computers And so we had programs like spreadsheets and we could work with them on ourselves by ourselves We then started to get networks and almost immediately we started to have web accessible databases where you could go and query the pdb and gen bank And the next big evolution and people have been working on this for decades Is to try to use the web as a mechanism to access data So you'll hear about the web of data the semantic web link data This is all ways of trying to expose the structure and content data to a search engine or to a url Based query mechanism so that you can get at it and we're going to talk a little bit more about that later But i should say that for many many years we sort of labored about this and we got collective yawns from most of the NIH and most of the scientists because they're like i know my data I generate my data if my colleague wants it i give him my data They never thought about sort of exposing things on the web But big data streaming data being able to get access for example to all of twitter The ability of being able to combine all these multiple streams for predictive science has made big data all the rage even at NIH And there are these big programs big data to knowledge that are starting to come out And they just hired a director of data science recognizing that data science is going to be so critical But it is interesting that even google has problems with data So those of you who have been on google lately may see these little information boxes over here So if you type Leonard Nimoy it gives you these little facts about Leonard Nimoy when he was born when he died who he married And this is based on something essentially google's version of the semantic web if you read some sort of snarky comments called the knowledge graph It is a way to take all that information in web pages and databases and structure it in a way that google can access to it But if you read in depth about it you see that there's something behind it called schema.org And if you read what it says it's like many sites generate structured data When this data is formatted to html becomes really difficult to recover the original structured data So basically you need somebody to go in and annotate and mark this up so that google understands What is in your database and they've invested a lot in this But it basically says that even google needs a knowledge framework and it needs some human intervention to try to make The data accessible to the web and we are working very hard on this again So that we can give protocols to people producing scientific content so that again, maybe someday we will just be a sidebar in google rather than a separate thing But for right now It's not that easy again to make your data available. I think this is coming And so nif developed a data ingestion architecture. It's actually based on some tools coming out of Yale University called disco for discovery But again because there are no requirements for data sharing And because we really again don't know exactly what the best way is to make data available We designed the nif to be a very low barrier to entry So the lowest barrier to entry is just to make a registry entry in the nif to say I have a tool It exists. Here's something about it takes five minutes. Anybody can do it We have automated pipelines that troll through the text of published articles looking for research resources We have curators nominations by the community and as you'll see later Each of these is given its own unique identifier that is being used to identify these things in text The data federation requires some programming skill. It involves actually using the disco interoperation tools But again, it was designed to be very very quick So you could take a resource like open source brain and within two hours you can expose it to the nif search engine And we did that on purpose again because there's a lot of really Deep work on data integration But we don't believe that any solution a that requires an ncbi level institute is going to be practical Secondly, we had a lot of things on deep data integration where you could do a lot more and after two years They had four data sources actually deeply integrated But if you've got thousands of these you can't spend that much time on them So you've got to have a way to sort of get them in quickly And then also make it so that you can incrementally refine them as is needed for various operations and as technology changes So when a source comes into nif We do several things and one of the things as we try to unify it So you saw that the initial results is sort of a result list I'll google if you click on any one of these it gives you a table And we have to take very very complex sources and figure out what sets of meaningful information We can fit on a page because that's how you browse that you don't want to have to go off to 10 or 15 pages So we try to get the key things about that database available We do various things as for example, again We use search variants to make sure that we get all the appropriate terms We categorize each of the data sources by data type and level of the nervous system data type very loosely Defined what would you go to that database to find out? We provide always links back to the record in the original source We provide tutorials various query expansions But again, I think one of the most important things we do is that we make everything kind of look and feel the same So that you can browse through the data very very quickly And that turns out to be um, you know, I think a significant Advantage to kind of going through nif when you're looking for data Not that nif replaces these resources, but you can see bio not negation Brain info gamma micro array all kind and look the same In contrast if you go to three different databases even about connectivity You'll find three very very very different interfaces And so it's very difficult to go and learn and see what's of value when you have to relearn the system every time you go And I think it's important to recognize again if we think about our publishing Environment right now we have thousands of independent journals thousands of them But if I were to ask you this is a scientific paper blacked out what this was up here. What would you tell me? This is probably authors affiliations. What's this? Abstract which part of the paper is this? Introduction right so we sort of have a pro forma form that makes everything sort of be the same So regardless of the fact that there's variation when you open up any of these journals You don't say okay, and I had to spend a whole lot of time figuring out how you structure information So even though every time I suggest that some sort of standards for biological databases would be good And people recoil and horror that you know, you would ever have such a thing I remind them that we've got hundreds of types of cars But I can rent a car get in and within five minutes figure out how to drive it off Yes Yes So it turns out that the ranking is very very difficult for a discovery portal because when you put something in We don't know what it is that you're looking for popularity is not necessarily the appropriate Ranking for it right because you have in your mind what you want to do with this data They often say that scientists are the only ones who look at beyond page one of google results, right? Because of that But I would say that if you look at the ticketing system of nif and the ones that come from me My number one complaint is ranking and it turns out to be a very difficult problem Because you don't know whether you want something very specific or you want kind of a broad Approach over the data. So the the computer scientists. So I say this is a discovery portal But I'll talk a little bit later about how we can make use of behavior to try to improve the way that we present Results to a community. It's a good question So what can we learn from the nif data federation? So if we look at the growth of the nif data federation over the last few years And this is the classic microsoft excel. You cannot go from one place to the other without its screwing updates So data transformation very difficult. Okay And this green line here is the growth in the number of records of diff. This probably ended Probably april of last year And you can see that early on there was a rapid growth in the number of records And that was largely driven by putting in large sources like the island brain atlas for example and some of their brain spans But that most of the growth is actually relatively small So what you're seeing is sort of the accumulative effect in nif of what's often called the long tail of small data And the idea is of course, there's a couple of large really large data sources that drive it But most of the data is actually in this long tail. These are smaller Smaller databases smaller individual data sets and the question has always been can you learn anything reasonable from sort of aggregating this information together? Yes, we can make it available. But is this data in any way useful? And recently we've been doing Some things to try to look at the analytics not of any individual data source But of a data federation as a whole and this was a very apropos I think to the conversation that we had yesterday about known unknowns and unknown unknowns and all those other things So what you see here is a heat map and across the top are the 200 data sources or so that we queried at the time that were available from nif and these are all sorts of things They're all different types of resources. They're again primary secondary tertiary And on the left is our ontology of brain structures our list of brain structures There's about a thousand brain structures or so that we have and it's reasonably comprehensive Not for spinal cord as we know, but there's a lot of stuff in there And the first thing that kind of pops out is that it's very sparse And that we know a whole lot about some things and we don't know so much about other things And if you actually use the structure of the ontology to give you some guidance What you see is what we heard yesterday that there's a tremendous bias in the data space Of the forebrain. So there's a lot of information about forebrain We didn't differentiate all the different parts of the cerebral cortex, but this would get even heavier A lot less about the midbrain and a lot less about the hindbrain You also see that quite clearly there are some neuroscience specific data sources here because you see and these are all anatomy databases And so they have a lot of information about anatomical structures But you also see these horizontal striations and if you look at those you see those are high-level terms like thalamus hippocampus Uh striatum and cortex that get mentioned across databases that are not from neuroscience in particular So that there's a level at which knowledge crosses over from field to field But it is a very high level compared to what we would want to know in neuroscience And so it's interesting if you sort of ask about this why is this gap here now granted We don't have all 3 000 databases. This is a dynamic thing, but there's 800 million records there. It's not a trivial thing and so you might say well We've bothered to name all these structures. We've bothered to name all these brain regions Why do they not show up in the data space? And if you look at text mining Articles that did the same thing in the literature with brain regions They found the same thing a large number of brain structures never show up in the literature at least in the abstracts of the literature So in the case of for example, you might say well, maybe we've named a lot of brain parts that actually have no functional significance I mean, it's highly possible that we have done such a thing Um, we conceptualize things, but it may turn out that they're very specific. They're not all that useful to us But if you look the midbrain actually had very little annotations And this is the new allen brain connectivity matrix and this is the midbrain right here Turns out it's one of the most connected regions in the brain. It has got a very very heavy set of interconnections So we've ignored it, but it probably is a major waystation in terms of connectivity There's another interesting thing that we find and that is um There might be no the reason why it's not there is it's below the experimental resolution or doesn't fit the prevailing paradigm So we know for example in neuro imaging that the cerebellum was ignored for many many years Even though it lit up all the time it was filtered out because it was either an artifact But it also didn't fit our model that it's a motor structure and it pretty much lit up for everything So you see in the literature going way up as the techniques got better And we started to realize actually the cerebellum is a major waystation And this uh, this paper right here says, you know neuro human neuro imaging is typically performed on a whole brain But the tail of the caudate is not easily resolved And so we don't in fact it doesn't even appear in free surfer and a lot of the structural Atlases because it's difficult to resolve it just doesn't show up So there's a lot of things which will not show up in the data space because we don't have a technique Or again the prevailing paradigms are not sufficiently sensitive in order to be able to see it And i've talked with various colleagues across different domains and even in the fields of molecular biology pathways and others There's an incredible bias towards certain structures and other things which never show up anywhere at all If we combine that with all the dark data and other things that we're not analyzing you realize We're building models and we're doing analyses on very limited subsets of the data Yet again, we've bothered to sort of name all of these things So clearly somebody thought that this was differentiable So I think it's just sort of a different an interesting insight and without a nif and those ontologies There'd be no way to actually look across this almost every system only bothers to Name the things that they study. They don't put all the other things in there and then say we don't have any information on them So this also lets us look at something very interesting as we notice that there was one source that seemed to correlate Extremely well with the data space no matter which branch of the ontology we looked at And it turned out to be the nih reporter, which is a database of funded grants And so if you actually look at the number of times that the A term is mentioned in our data space and the number of times it's mentioned in the nih grant Except for those sort of floor effects because a lot of them aren't mentioned at all You see this lovely correlation that says if you get funding for it You produce data and you put it out and if you don't get funding that you don't So again, it's sort of an interesting correlation that is perhaps not appreciated how much factors external to science You know drive what it is that we do and again in a data driven world. I think these are important things to be able to uncover No Well, I guess funding would not be no, it's not No, no, but I meant not in terms of what we can do in in the laboratory But there are sociological factors that drive funding for example that also drive our science So it is integral. It's outside of the laboratory. I should have said So we've also learned something else about the nature again of this data landscape Because what we see is that we know that originally nif was actually conceived of to try to avoid duplication So we had a very famous example where somebody was in a A room saying well, I'm going to develop a database on small interfering RNA because there are no databases on small interfering RNA And we typed it into nif and we said actually there's 11 databases on small interfering RNA. It's just very difficult to know what they are But we also see that the data the data landscape is a very fluid place So that there's many data sets that get ingested they get moved Databases get partially ingested by something else people add value to them They recreate they change them and we currently don't have any good system of sort of following the data as it sort of moves Through this ecosystem, but clearly if we cannot stop duplication The next best thing or perhaps even a better thing is to be able to learn from that duplication Very rarely do people model the same data twice very rarely do they find the same information from it So if we can start to track and watch these things as they go They teach us something about the nature of of science it teaches us about the concepts that we use So for example, you'll often hear we need well curated data There's not all that much consistency from curator to curator to curator right it is a skill And so if you give the same data set to two people they will extract different things They will model it in different ways We would like to be able to learn from that So one of the things we need to be able to do and we're working on nif is creating an identification system That allows us to sort of track where all these resources go And really one of our current challenges is is because we have so much available How do we put what is the equivalent to page rank that lets you get the most relevant information that you want ahead of you And that turns out again to be an extremely difficult question because if you start to look at all the databases and the way things are modeled Sometimes there's a direct answer to your question So is there again a database for small interfering RNA dancers? Yes, and here they are Most cases though if you look at the types of use cases that get pushed through nif It'll be a conceptual query what genes are upregulated by chronic morphine And the answer is always it depends it depends on your definition of chronic morphine It depends on your definition of upregulated it depends on You know which data sets are available and which tools you use so what most people are really being asked is to Connect you with a possible data set and a possible set of tools that can actually be used to derive this information And many databases actually have tools and workflows that support this but it's very difficult to know that So we're starting much more in-depth analysis and passing use cases on to these different databases and gathering what these databases are good for So that we can help you But we've also started to work on several different ways to try to make it easier to find your way through nif So one of the things that we have and I very rarely do live demos when there's a web involved because It's very very slow But one of the things we've found is that when you actually search nif you do far better by putting in very general search terms That's different than google where you put in very very Specific terms it's better to put in general terms and then let the system itself tell you What you have available and where you should explore next so nif makes a lot of use of facets again Categories and other things that help you go in and determine What it is that you are looking for and which sources are available Sometimes we integrate across sources most of the time they come in as individual sources So we provide these tools to allow you to go in and look at these again We've created a google-like interface. These are snippets that are automatically generated from the data and then we unify the data We provide Insource facets for you so you can look here for expression level and say I would only like to see increased expression We provide in column filters that allow you to explore those much more easily So the search strategy that you use I think is a little bit different The thing that we found though is if you actually track this through you do see that the same data set often appears in multiple resources So we had a database called the Drug related gene database, which was curated from the literature and tables Some of those people deposited their data into geo And some other resources took the data from geo and reanalyzed it So we said I wonder if we could use nif to find out how many data sets have been reanalyzed by multiple different parties And it was very interesting that when you tried to put the data set together Every way that they could possibly be different they were different So if you looked at gemma, they use gene id and gene symbol the dr G used gene name and probe id and you had to sort of match those things together When we first aligned the different data sets together We found that they were 100 opposite from one another and we said that is every gene that was up regulated was down regulated We said that can't possibly be so we contacted gemma Gemma uses an automated algorithm to analyze their data the machine Automatically and randomly assigns experimental and control groups So everything was was reported control relative to experiment the human curated database did it experiment versus control So we had to slip them all over When we finally got everything analyzed though again over half the claims of the paper were not confirmed in the reanalysis So the idea again that you have one data set one algorithm And we saw that in the last couple of a couple of days and that that is your result I think Is now can be challenged by the availability of these other data sets and tools so that you can do these in multiple different ways Again relatively simple standards like Control versus experiment using the gene id makes life about a lot easier But in the end the ultimate result is not usually a set of conclusive things It is a information space a data and tool space that you use to try to get the answers that you want And that's why this sort of page rank is so difficult is how do you get people to know this because there's many different ways to do this The second way that we are starting to work is that we know that there are different places that operate on these different Data sets we know that they're ingested So we're starting to incorporate those links inside of the nif when we know such a thing has happened So for example, this is the model Integrated model database that nif has it has neuron db open source brain a bunch of different places that publish models very loose semantics And there's a project at the san diego supercomputer center called the neuroscience gateway where they get a lot of these running on Some of the exceed and parallel machines So they've taken all of the models for example in model db and see if they could recompile them and get them to run Some of them run some of them don't but we let you know if you come across one of these in nif that this is in fact Running over at the neuroscience gateway so that you can take advantage of that if you want you No, they can't get them to recompile Somebody else might be able to someplace else and so we'd like to again provide these tentacles That let you know that there's a path here to go to that data because again This is a matter of art in many cases and resources not necessarily that the model is good or bad The third thing that we're doing is something called side crunch So a lot of different communities are trying to do what nif does and they ask us can we build a nif for x And we realized that in fact, there's a lot of need for Individuals to create their own spaces with their own resources in them that limit the complexity Where they can add additional facets and curation they can organize the data the way that they want to organize So since the nif system itself is really not neuroscience, it's only the content that we put into it We created something called side crunch, which is in beta release And essentially it allows you to take these data sources the ontology infrastructure the data infrastructure create your own portal You can actually stand up a data portal in an hour or two if you'd like to Only instead of creating silos like we typically do where all of these things are disconnected because they're using a shared infrastructure They can feed each other. They can take from each other. We can also learn what people do to these data sources in these different domains So it's very cost effective But it also has sort of a social networking feel and that while you're doing your work We're learning about what you're doing. We're learning about which data sets are important So right now side crunch supports one two three four five six Different portals that are all built from the same resources Each one says which data sets they would like how they would like to organize them how they'd like to organize their facets Do they want to categorize them in this way or that way? We even have the nsf earth cube here So it's just sort of a general model But each one of them also Enriches the core so anytime a data set is added through any of these it becomes immediately available to anybody else Anytime a data set is curated it becomes available and the views become available So here for example, you can see database Diabetes related resources that are in the nif portal because of the dk net the metabolic disease portal They become available and of course diabetes is very relevant to neuroscience We also see phenotypes that come in from the monarch project Which i'll show you in a bit which has a very sophisticated set of algorithms for comparing animal models and disease And we've also started to track which ones get ingested by which communities So you can get some sense which ones are important to different domains by who is actually bringing them in and who Put them into the space So we're starting to develop again algorithms that help to perhaps share expertise across communities because oftentimes you find that a community That the neuroscientists view of diabetes may not be the same as the As the metabolic disease reachers view of diabetes And so this is a way of letting people know by the tools and the data they use who else is In their circles. So this is very early. But as I said, we've already had quite a few portals that have been created So let's look a little bit at the nature of the information frameworks themselves that we use inside of nif So I mentioned again bringing things together sort of unifying them and at various times I've sort of used the word ontology And I'm sure many of you have heard the term ontology many of you even may be working on ontologies But this section is going to talk a lot of it about that But the idea of the information framework itself is basically that it's a tool for analyzing and structuring Information and I liked the definition yesterday of information a reduction of uncertainty. So I put that in there So what constitutes an effective information framework for neuroscience? We really did struggle about this We have mentioned the fact that we deal with many different data types. We deal with many different species. We deal with many different techniques and experimental systems And the two most common frameworks that are used for organizing knowledge is one a spatial component And we saw that yesterday with the blue brain the mouse brain is a beautiful container for organizing information about the brain It's certainly what biological systems do There's also though knowledge in words where you have terminologies and other sorts of Relationships that you use to describe your domain And that was the example for example of those brain structures that we had and this is a very important way Obviously that humans communicate with each other. We speak to each other in terms We tend not to speak in coordinate. So there's usually a back and forth between the two of these And you can't really see from the Contrast here, but basically this would have been a picture of the cerebellum. This is a purkinje cell a western blot a plot An electron micrograph A microwave and if you kind of look at these things There's little obvious relationship between these so just putting them all in the same space Even if you could do that across all the different species and developmental ages that we work Doesn't tell you a whole lot about how these things relate to each other And also it doesn't tell you how this if you're walking by and you look at this You have no idea what it is. You cannot look at that and say, oh, that's cerebellum. Okay, there's no meaning to it So what really connects neuroscience together as a whole is not the data type like in genetics It's not protein structure It is in fact the Domain knowledge that we have of the things that are important to us and how they are connected to each other We had that question yesterday. How do I know that that's a synapse, right? I mean, there's a whole bunch of domain knowledge that you are introduced into and really The field of ontology is all about that it is about the meaning of things It is about the concepts that you use to describe your domain It is about how you organize domain knowledge the field of ontology itself An explicit formal representation of the concepts and relationships among them, which a particular domain Has been around for thousands of years. We have wondered for a very long time Plato's forms Aristotle all of them. What is the nature of the things that we talk about? How are they related to each other? But in recent years, this has taken on a very specific meaning It has taken on the idea that I'm going to express knowledge in a form Where human where computers can do some of the same sorts of reasoning that a human being can do So usually ontologies again, there's a set of concepts There's a set of relationships those concepts are hierarchical So I have something called an organ and a brain is a type of an organ And that there's some reasoning that can be done which basically says that You know a purkinje cell is in the Purkinje cell layer a purkinje cell is a neuron therefore neurons are in the purkinje cell layer You know all types of basic reasoning that goes on um So it is this idea that we can express human knowledge in a way that is machine computable That really has sort of driven an investment in biomedical science into ontologies and related products So nif also has said this is the this is the framework that we use to organize knowledge Everything that comes in is mapped as much as possible And this is a very difficult thing to do to some of these ontologies So we've assembled a set of ontologies that have been developed by the community There's the gene ontology protein ontology small molecule ontologies Neuronames fma all kinds of community ontologies and nif has joined them together to create its information framework That helps to sort of relate all these different entities across the different databases and representations And it's actually a very impressive knowledge base because if you think about the expert knowledge that you need to navigate All of these data sources what you are in school to learn and you plot that as a graph It looks like a little nova here. That's the entire nif ontology. This is just the nif neuron ontology right there Okay, there's a lot of information in there that we use to make sense of these sources And clearly in trying to search across this information space. We need to be able to have some computability to it So what can ontology do for us? I'd say it is one of the most contentious areas in biomedicine There are people who believe in them and there are people who despise them However, if you use them for what they are intended for and you recognize that all of you are filtering Your search and everything else through a domain model That when you get indoctrinated in in neuroscience people sit you down and they say this is More or less the way the we think things are put together Then you find out that they're very very useful things So the main thing is is they express neuroscience concepts in a way that is machine readable You already saw a nif we use synonym lexical variants very very important It provides a means of disambiguations of string So everybody knows that the word nucleus can mean many many many different things to just a string search Who the heck knows which one it is c a 2 can be the hippocampus? It can be calcium, you know So all of these things are ambiguous and when you use just pure string search It takes a lot of extra effort to disambiguate them So in ontology everything's given a unique identifier essentially a social security number So there is no disambiguation that's required. It's very clear which one you mean But it also in the area of data integration provides the universals that let you go from one source to the other It's let you join the microarray and it lets you join the image because it understands the things that those things reference The genes the molecules the cells the brain regions whatever it is that's there We do not put over stock in ontologies as Methods of truth. We say it is a domain knowledge. It is our current domain knowledge. It is somewhat general We know that in specifics you will always find exceptions, but it provides a very powerful semantic search Index through which you can organize information and as you see even google in its knowledge graph google always ran away from semantics But they went back to semantics and said this is lender nemoi. He's a person and these are the things I know about people Okay It provides the basis for concept-based query You already saw how we used it to do a landscape analysis and for some types of knowledge It is actually a fairly good data representation that lets you do a lot of things with that knowledge Not all types, but some types. So again as with anything it is a tool that when used appropriately is very powerful So when any source is invested into nift the very first thing we do is we is we map the columns That come in to one of these entities and it turns out that we have four or five thousand columns And about half of them map to one or more of those entities sell anatomical structure disease These are the things that we talk about Technique and that already helps us a lot because if you query nift for example for cerebellum You will get things like the zinc finger protein of the cerebellum because genes diseases all kinds of things have the word cerebellum But if you put anatomy colon cerebellum It only searches for cerebellum in those things that have structure names because it's an anatomical structure And that's what you're asking for so we do use these as sort of search filters to reduce false positives We also use it to provide meaning We said you had to make your data understandable and a lot of databases use the word one for example to mean male or female or cerebellum or Alzheimer's disease they use custom abbreviations We have we can't possibly get all of those abbreviations But what we can do is map it to the appropriate identifier so that we don't need to worry So if you query nift for broadman's area 10 it gets bat broadman dot 10 that's used in some db So it's very important for us for that sort of ambiguation We also again use it to help you probe the information space So if you ask nift what genes are up regulated by drugs of abuse and the adult mouse genes Upregulated adult mouse we need to translate what adult mouse is because a lot of databases give us age And you might say as a scientist Oh, well, you should always give us age and that is true But if you go to google and I asked you to find adult squirrels, would you know what the age is? Right so oftentimes we use these categories to help narrow down what we're searching for so it's a very effective search term Drugs of abuse you'll notice we're returning morphine here because the morphine is a drug of abuse So we use ontologies to encode and automatically execute rules that define certain classes Where we think it's appropriate But another important reason again for ontologies is a data integration framework I think is shown right here nift has currently it has more now But it had seven databases when I did this analysis that dealt with connectivity about brain structures So brain region B and when brain region a were connected to each other with some strength We extracted 800 unique brain terms from the seven databases We excluded the avian brain because the nomenclature hadn't been modernized And what we found was the number of exact terms used in more than one database was 42 Not in all databases, but two that means the exact same strength So synonyms helps right because then we got 99 But most of them were not related at the level of superstructure most of them were related on the partonomies So that you had multiple Different species you had multiple different parcelations that were part of some superstructure So the only way you could sort of join them together was by understanding the part of relationships between these different things It was not a direct join It was a join through a relationship and more often than not this is what we find It's a join through a relationship not a direct join together So they are rather critical for trying to put these things together Now I don't have much time to go into how one builds ontologies I would say again, this is a very sort of contentious areas, but there's different schools of thought different schools of thought about how you name classes whether you shared ontologies whether you use Language called rdf versus owl single versus multiple inheritance Should I encode everything in my ontology and I advise you that if you are going to Start building an ontology that you in fact seek advice and do some research before you do so But I will tell you what my view on how to build ontologies is Because I have the platform. So the very first thing is the name ontology itself gets a lot of Sneering because a true ontologist will look at something like a taxonomy, which is a simple classification and go that's not an ontology That's just a taxonomy In fact, there's different levels at which one can operate. There's just simple controlled vocabularies Which says here's my list of things and you must use this term that tends not to go over well in a broad information space There's a lexicon and thesaurus which we use a lot. So here's a term. Here's lexical variants Here's synonyms. Here's all kinds of things that go with it. Here's some definition Very very very useful There's taxonomies and hierarchies and a lot of people kind of balk at these but in fact We use hierarchies all the time when we're searching for things So it's a pain in the neck that if I wanted to search for neurodegenerative disease I'd have to give it a list of 45 things that might be neurodegenerative disease It would be a lot nicer if I could put a neurodegenerative disease And it just gives me a list of all the neurodegenerative disease and searches for those And so we try to do those so hierarchical searching is very very very useful But really ontology is distinguished by the relationships and the expressiveness of the ontology What sorts of reasoning can you do with the ontology? And this is a nice graph that says here's our weak semantics strong semantics This is time and money and basically Ontologies are expensive And in everything we do there's compromise in terms of how much resources we have how much time we have How much attention we have and what we can accomplish and that constraint does not change even with ontologies You do basically what you can So just again an idea of how one uses an ontology Identity is important it identifies entities and classes and those are uniquely identifiable And best practice is really not to use the names inside of your Class names because names again are ambiguous What you really want is a meaningless numerical identifier That is the class name that can be mapped to many many different human readable labels So if we look at burnlex 1362 and keby 29108, they're both ca2 But this is clearly a molecule because that's its parent and this is clearly a brain region because that's its parent So two different identifiers even though you have the same label But it becomes very awkward to try to give unique labels to everything and that's why these are are very good It's also good as I understand it computer science practice Every ontology should have a definition and those definitions should be instantiated in the in the relationships that you have There should also be a human readable definition. So A Two things inside of a definition is genera. A is a type of b. It's a type of cell anatomical structure cell part And then it has different other relationships that differentiate it from other members of the class And if you look at those relationships a machine ought to be able to derive the same definition that you do as a human But then there's the implementation again. How is this definition expressed? And you can say more or fewer things There are different languages that are capable of expressing different things about the concept So two standards that are used a lot is the owl w3c standard which stands for web ontology language And also rdf which is the resource description framework. They're both ways of expressing some of these things, but they have Different just like different languages. Let you do different things computer languages. It's the same with ontologies So there's always a question of how much semantics that you need and what you're going to do with it And you can see that we have xml which really in and of itself is not that concerned with meaning There's nothing in the structure that lets you derive a whole lot rdf is based on xml, but it's used to represent knowledge in a distributed world It's designed really for knowledge and not data and the essential unit of rdf or something called a triple So you break knowledge down into a subject a predicate And an object so a purkinje neuron has neurotransmitter gaba subject predicate an object rdfs Is a method of specing metadata about it So it puts some structure on top of it and then we have the web ontology language Which is a more complex and powerful extension of rdfs, which I think is rdf schema There is a query language that can be used called sparkle which allows you to actually query these triples The same way that you can query sql databases And this is really the foundation Of what's called the semantic web this idea that you can express things inside of these triples these triples form These very large graphs and you can query them Essentially Leonard Nimoy and all his properties represent a set a set of triples about Leonard Nimoy So if we think about sort of increasing semantics, you see the relational model here draws a relationship between these two But there is no obligatory relationship about what this is. It's an arrow We can supply relationships, but the computer itself does not know that it's just a relationship So you can ask in a relational model find me all mice that have been used in an immuno labeling protocol That uses a confocal microscope But if you find want to find out Everything that uses a confocal microscope and there may be something else Another a person uses a confocal microscope It can't go back up the tree whereas in rdf an owl you can go back up the tree So it can find these sort of anonymous classes So it's the expressiveness of owl that allows for more powerful semantics But that also means you have to build in a lot of this and structure it properly so that you can take advantage Of this expressiveness. So there are various restrictions that you're allowed to put in You can do this a little bit with rdf, but it's harder So for example a statement like the thalamus projects to the cortex and mammals you can make that a universal restriction Basically if a mammal has a cortex and a thalamus then the thalamus must project to the cortex And you can infer that without ever having to state it because you've built in the rules You can have existential which says well the thalamus projects in the cortex And at least one mammal so there is some cortex that receives a projection from some thalamus But it makes no statement about whether this always has to be true. Okay And there's things like disjointness So a member of one class cannot be simultaneously an instance of another class So if I am a vegetable I cannot be an animal And the reason these rules are important is that there's reasoners that are used So after you put all your rules in there's basically a reasoner that will come in and classify everything Appropriately according to the class that they are in so it becomes a very powerful data structure Because for example if you were segmenting like you saw in eye wire And you said if there's a pre-synaptic component There must be a post-synaptic component and somebody put in a pre-synaptic component it would say i'm sorry If you said this is a synapse it needs to have both and you need to provide both for me So you can do a lot with these but many scientists of course are a little uncomfortable because they're very Prescriptive about what they say and very rarely do you get to use existential relationships or universal relationships But this shows for example in the nif cell ontology how we make use of the classification property So I have something called a neuron And neurons come in many different classification schemes There's a spiny neuron a cerebellum neuron a cerebellar or prikinji neuron Which is a type of neuron a principal neuron and a GABAergic neuron Now using appropriate rules and a reasoner I can run The ontology with a set of rules that says cerebellum neuron is a neuron whose soma lies in the cerebellum A principal neuron is a projection neuron basically GABA GABA neuron is a neuron that has GABA as a neurotransmitter and a spiny neuron is a neuron which has spiny dendrites I've asserted relationships for my prikinji cell. I said if it's a neuron I said it has a cell body that lies in the cerebellum. It's a projection neuron It uses GABA and it's a spiny it has spines on its dendrite When I run my classifier, it basically says my prikinji cell is a member of all of these classes without me having to assert it I basically give it the properties and it puts it under the appropriate classes So you can imagine when you're trying to keep track of all the different hierarchies things can belong to having this very compact way of generating Many possible hierarchies is very very powerful So in terms of building ontologies the question is well, I want my own ontology usually and This is a very common thing that people want to do. They know their data. They know their model They want to express it. They're like all I need is this. I don't want this complexity But in this web-facing world when you do something custom like this You make it very difficult to integrate across because matching ontologies and understanding what group a and group b says is very very very difficult It is an open area of research So you do far better when you reuse a set of concepts from an ontology Even if you say different things about them, at least we can use that ontology identifier as a key that helps us aggregate data together So what allows the sort of web world to work and what allows Distributed data to work are these set of common keys And so you may have heard of linked data. You may have heard of the semantic web It's often prescribed as sort of a miracle for data integration But if you use your own identifiers your own custom relations, it's no more integratable than a database schema It's when you use the common identifiers when you use common relationships that we can build out this graph So nif always favors reuse of community identifiers rather than Minting our own But what we are as opposed to many in the ontology community is we let you say whatever you want about them So it basically is a set of building blocks that you can use to say things And to sort of echo what we said in side crunch We in fact want you to do a lot of different things because if you do a lot of different things We can learn about the entities themselves without People having to spend a lot of times in rooms arguing about them So the other advantage of using shared building blocks is that when you use different ontologies you get different values So for example, if I pick ip3 receptor from the keby ontology Then I automatically inherit everything that the keby ontology says and they go and they talk about structure and all kinds of things That I don't care about but I get that for free Oops, that's my timer So there are resources that are coming about that are actually using ontologies much more robust I mentioned the monarch project, which is a project that's that's designed to match animal models to human diseases based on phenotypes And they take nips data ingestion, but they do a much deeper modeling of those individual sources So here you see a little movie of it where you're entering phenotypes. This is a phenotype analysis tool So I'm putting in brady kinesia. I'm putting in dementia And then I'm putting in tremor and those of you who have some clinical know What this is going to return But then when I say search it comes back with a list of things that have those particular phenotypes Only there's a very rich semantic model behind it So it doesn't only just get brady kinesia and dementia, which you can see over here are characteristics of parkinson's disease But as we start to go down the list, you'll see cognitive impairment and you'll see slowed movement and you'll see abnormal motor activity As things that get returned based on semantic similarity So it's a very very powerful tool for matching across animal models in human condition Where oftentimes the vocabularies are different and the exact symptoms might be different We very rarely say an animal is demented, but we will say that they're cognitively impaired And then it will give you a list of humans mice Zebra fish also drosophila that share those phenotypes together So if you're looking for a model that recapitulates a lot of different things It uses an algorithm to calculate how similar those things are So you can do a lot of things again with this rich semantic modeling And we build that on top of nif rather than in the beginning because again, it takes a lot of time and effort So I just want to let you know there's a lot of ontology tools and services that are available There's something called bioportal, which has 300 different ontologies nif maintains a full service of Web services for autocomplete and other sorts of things There's the oboe foundry and various tools and incf itself has a program So my plug is you basically can enhance your tools and annotation with community ontologies and it's very reasonably easy to do so So I don't have too much time to go into the neural x But I just want to point it out that we do need a way to engage with the community About these ontologies. This is domain knowledge. Nobody has enough knowledge to do all this themselves So we've actually exposed the nif ontologies Mostly through something called the neural x which is a semantic wiki So rather than like wikipedia, which is just text In a semantic wiki you have pages that are related through formal properties Essentially, it follows the rdf model where you have page relationship page and so you can use this to in fact assemble quite a large knowledge base and If you look at the neural x you see that there are pages that have properties that are associated for example with neuron types These things are filled in and every time there's a blue link It means it links to another category page and therefore you can use some of these rule bases to actually First of all it links to the nif data But you could for example use a definition such that if you fill in gabba as a neurotransmitter for a neuron It automatically appears on the gabba-urgic neuron page So you don't have to assert it you just have to fill it in and it automatically calculates it So it's a very very good tool for learning about structured knowledge and semi-structured knowledge And it's also turning into a fairly significant knowledge base for neuroscience There are those in neuroscience and some of this is getting a lot of traction Who believe that we all ought to be writing in triples and we all ought to be structuring our knowledge as graphs And there have been various projects like gullyburn's k-fed projects who's doing structured protocols Alcino silver is doing research maps, which is again a variant on all of this and then there's semantic web approaches You can see some of the presentations last year There's a lot of debate as to who should do the structuring how it should be done whether neuroscientists are good at it or bad at it But these ideas have been around for a while the tools are getting a lot better And if we're thinking about how the technology platform for disseminating science needs to exist This strongly is considered. I have my doubts whether scientists actually can do this I also have my doubts that writing in triples is going to be adequate to express all neuroscience information Because even relatively simple statements end up taking up huge amounts of graphs But I just wanted you to be aware that there are those who are pushing this And I will skip over this red links Yes Yes, I think the tools will get better But I wanted to finish up with one thing I call identifiers in action So a lot of what I talk about annotation I talk about these sort of domain knowledges Seems very very abstract and it has been very very difficult for many people who are working in this space Not just me to engage or interest the neuroscience community in using these unique identifiers and using these shared things to help link information together So we we wanted to come up with a demonstration project also one that would help nip in its mission of identifying research Resources that would help illustrate the power of these unique identifiers These unique identifiers are not just being given to General concepts like cerebellum, but they're being given to individuals. How many people have heard of orchid Okay, so this is the unique ID that's given now to authors. So if you're a scientific Author you go and you register it because you know how difficult especially if you name the samlee It is to actually figure out which papers belong to you So a lot of universities a lot of journals are now Subscribing to orchids so everybody should get their orchid ID And basically if you want to develop a system of credit and attribution for data and tools We need to know who you are this can not just be attached to your papers But it can be attached to data tools or anything else. There's a lot of work going on So we were doing the same thing with research resources and in one of nip's early projects We were trying to use text mining routines to be able to identify the resources used in the literature Those were not just software tools, but they were things like antibodies and they were things like genetically modified animals And it turned out to be extremely difficult to be able to do this The main reason was authors were not providing enough information to unambiguously identify what it was that they were using in the papers The second was it was behind a paywall. So you couldn't get at it because it was in the materials and methods And this is rather critical because as you've heard many times today These tools are imperfect antibodies are imperfect animal models are imperfect our software tools are imperfect But we have no good way of tracking back and forth in the literature when a problem arises Other than our very archaic system of citation of articles So if for example, you're reading a paper about free surfer and three years earlier somebody had said Oh, hey, or three years in the future. Something's a matter with free surfer. There would be no way to know that There's no alert service or anything else So we wanted to be able to address that and since we had these big Registries, we said we should use those to address the problem. And here's what the problem really is I mentioned that the web was a fluid thing. So are project catalogs. So you might see something like this I use the monoclonal antibody against actin from sigma aldrates. I go to sigma aldrich And there's 40 antibodies. And if I go to the catalog every year, it's a different subset of antibodies So there's no way for you to know which one anybody used So the the cure to that was something that we called the resource identification initiative And this really relies on these registries that you have a comprehensive source of enough of these things That you can assign them unique accession numbers So between the nif registry for software and databases the antibody registry which has over two million antibodies And all of the integrated animal model databases We said we wanted to attack this for a subset of resources to see how it might work And the design of the project was to have a proof of principle what infrastructure would be needed Could authors perform this task and would authors perform this task So we did not our neuroinformatics friends to mark up papers. We didn't want anybody who knew anything about it We wanted to see whether the authors themselves could do it And this is being run through an organization that I lead called force 11 the future of research communications and e-scholarship And it's of great interest to the blueprint because they want to track impact who's using these resources It's of great interest to a lot of different people who are working in the area of reproducibility So we got a coalition of publishers together and we designed a pilot project that said we're going to have authors Identify software tools and databases antibodies and genetically modified animals just three entities And they're going to put a unique identifier RRID with an accession number It's voluntary because a lot of journals were afraid they were going to push authors away by requiring them to do it There was flexibility and we made it very very simple so that journals didn't have to modify their submission system Because that turns out to be a really difficult thing to do took two years to get orchid IDs in there Um, we created a portal because it turns out if you want to get these identifiers you have to go to about 10 different databases So we unified them all together using side crunch. So there's one portal And we established a help desk and also made it very easy to find the citations So you can basically copy the citation and put it into the paper So this has been running since february It we're very pleased that these RRIDs are actually appearing in the literature and if you go to google scholar Search RID and select since 2014 because RRID turned out not to be a unique string You actually will see these RRIDs appearing in the literature google gets access to the materials and methods long before anybody else does And over 100 articles have appeared in 15 different journals There's over 800 RRIDs And if you put in for example a specific RRID you can get a list of all the papers that use that So these are all the papers that use that particular antibody So we learned several things about this. We learned first of all that Authors could do this. They were about 96 percent accurate in identifying the things that they used We learned by and large that they would do it. I mean it was voluntary But we got a very high compliance rate and nobody complained We also noticed that it drive it drove population of the registries. So people were adding software tools They were adding databases. They were adding antibodies to the registry We know that about 10 percent of the identifiers disappeared because during the copy editing process editors would take them out That's often outsourced to different places. So they had to be put back in and about 14 percent were false negative That was things that they should have identified which they didn't So there's a lot of work that's going on at NIH right now in other places to try to develop a system of data citation And citation for all of these other research objects that one produces And there's a lot of questions about how it should work and whether it will work And so this was a very important pilot project providing data that says yes Authors will consider alternative forms of citation. The journals will consider alternative forms of citation But one thing about this is is that there was no machine action ability. This was just a string But one could easily layer on top of that a lot more information There's still a lot of questions about what the appropriate identifier systems are should they be DOI Should they be whatever it is? But I think and which entity should be identified But I think it does suggest that a system for citing data and citing these other projects Products is in place and that when you know that you are citing it You start adding things to the registry to make sure that people know what's there So I think it's a very important pilot project and one that's pointing the way towards a new way of Referencing things inside of text one that is much more machine processable than the current system we have So let me just conclude with a few thoughts One is The landscape is messy diverse and evolving Technologies are evolving things are evolving. We don't know what the next best technologies are Scholarship survives all kinds of new things coming and going What we're really proposing is that there be some protocols that operate on top of this very Different changing technological landscape that provides some stability so that we can forward and back reference We don't think that everything is suitable for all things So we think that some things for example are best represented as ontology some things are best represented in relational some things are best represented in rdf Some things are best referenced as text It's really a matter of getting these things to work together and allowing them to cross reference each other rather than overloading everything into one form There's a lot of efforts going on through iNCF to provide a more Unified commons that allows these things to be exposed and used based on neural X So there's discussions going on Later in the meeting about what such a thing would look like, but it's clear that again There's a space for putting your data. There's a space for talking about the knowledge There's a space for storing your protocols and we need to be able to put these things together in reasonable ways So if we look at that very messy landscape that we currently have What people are really proposing is to develop an ecosystem for research objects Not defined in a very specific narrow sense that some people do But the fact that we're producing data code blogs and other sorts of things When we have an appropriate set of identifiers and a set of protocols that allow those to be used and linked Then we can start putting things in different places and pulling them all together In ways that we currently cannot do right now And the important thing about these is that these are persistent identifiers So a lot of people are in the habit of putting their supplemental data up putting up their own data files But one of the things that allows dois to work for articles even though we get some that are broken Is that when somebody becomes a scientific publisher and when they want to be listed in these index They guarantee that their links will be persistent and their identifiers will be persistent They will not take them up and down And unfortunately when you're a graduate student or even a researcher you cannot guarantee that because you're going to move from place to place So the idea that you would use formal repositories in places that have guaranteed this I think is a very powerful idea and it says that you should be sharing your data Through these formal repositories of which there are hundreds So just to conclude I think through the NIF project I've taken a much more global view on data When I first started everyone says I have the tool and when you put them all together with all the tools You're like you're one of many, right? So I like to say that my view has really changed from Our data from one to many. There's many data The generation of data is getting easier. There's shared data data space is getting richer There's many ohms, but compared to the biological space. It's still rather sparse I think we still don't have a lot of data in there Many eyes we've seen crowd sources We've seen ways of making use of the of the labor of crowd and the wisdom of crowd But we need to recognize there's more than one way to interpret data and when data gets out there and tools get out there We can start to to use those more effectively Many algorithms and many analytics There's signatures in the data not directly related to the question for which they were required but can tell us something really interesting That's why there's so much emphasis on analytics right now What those are I don't know but all I know is I'm having a lot of fun Looking at all of this data and looking for these sorts of trends because we've never been able to look at it like that before So that's basically it And there's force 11. You should all join it Okay, and I want to thank my colleague