 So, first of all, just to kind of emphasize, the process of science is a cyclic process. Typically, scientists will start off with knowledge that they have in their heads, that they'll be able to think about. Scientists will then be able to analyze what they know and try and come up with scientific questions that are valid and important. Once they're able to do this, they then have to, in order for a question to be a valid scientific question, it has to then be translatable into an experimental protocol. You have to be able to test your question and get an answer scientifically. Once you actually get your scientific experiment designed effectively, you then execute it in the process of getting the data relatively straightforward. Then, having done that, you then interpret the data to build back into knowledge. Essentially, the two things about this that I think are important in general for consideration in this community, our job as neuroinformaticists is to speed up this process, is to try and get the cycle to execute more rapidly for people who are actually doing the scientific work. That could be us too, but the scientists in the lab who are doing this work. Also, the hardest part of this process, I think, is being able to take knowledge that you know about and then automate or facilitate or help the scientists come up with a really good question. One aspect of this model that is key is the idea that you separate out the interpretations which occur on the left-hand side of the cycle and the observations which occur on the right-hand side. The reason for this is because the people who are very good at dealing with interpretations are people like this guy, Professor Honeydew, someone who's been steeped in the language and knowledge of the subject for 20 years who really has the secrets that they have in the back of their head, a stuff that no computer program could really represent very effectively until we build an AI system that can do science effectively. The processes of taking an experimental protocol and basically running it to get data and then analyzing the data through statistics, well, that's kind of more beaker's territory. He can probably handle that. In terms of the knowledge engineering systems that we develop, we want to try and emphasize stuff on the left-hand side, which is important because we want to avoid the creep of having interpretive knowledge kind of creep into our representation. One of the aspects of the work that I'll talk about is how we really say, okay, we're just going to look at scientific observations, we're just going to look at that. There's been a lot of talk, a lot of presentations have been referring to big data, and big data looks like this, it's human neuroimaging, protein-protein interactions, gene expression, all of that stuff, where you have machines that can read data from an experimental measurement and you just do it at scale and you just get a huge amount of information. But the fact is that if you go to the Society for Neuroscience, most of the people who are presenting in that huge poster sessions are presenting data like this. This is a cartoon of what a typical data slide would look like in a presentation. You have a scientific statement across the top that might say something about mice when presented with a cat freeze and this part of the brain activates or something. But then a scientist will substantiate that by representing the actual data that supports that assertive session across the top. The important key for an informaticist to look at is that there is a relationship between the dependent variables, the measurements that are being made, the assays that actually indicate the presence or absence of a feature and the independent variables that determine how you're kind of parsing up the space and trying to set up your experimental protocol so that you can actually reveal what's going on. Of course, according to our idea that this is the domain of Professor Honeydew, this is the domain of Beaker, we can probably deal with this. So let's ask the question, how do the dependent variables and the independent variables relate from this kind of representation? We have a system called Knowledge Engineering from Experimental Design, or KFED, which was deliberately chosen to allude to Kevin Federline because it's funny and might elicit a laugh from people and it's memorable in that respect. It's very simple, it consists of the idea that you basically have these various different elements as part of your model and then you draw out a scientific protocol using those elements. I've actually presented this to students at USC and asked them, could you do this? If you were faced with the problem of trying to represent data in a protocol, could you draw this out? And most of them raised their hands and said yes. So the question is, how do we relate the measurements to the parameters? Well, when you draw it out like this, it's actually relatively simple. Measurement 1 is going to be dependent on the various different parameters that occur on the spur of the protocol, on the pathway going back through the protocol that it lies upon. And it's a very simple kind of provenance-based approach, but it's very powerful and this is the kind of secret source that I think really is the main contribution of this methodology. Now, having done that, that gives us a kind of powerful tool and I'll use a couple of examples that it's not to do with neuroimaging to kind of illustrate why this is how we came to this modeling approach. So if you consider the problem of neural connectivity, which is probably quite familiar to a lot of people here, if you were to think about the kind of knowledge representation that you use to describe what a connection is in the brain, a macro connection between two brain structures, it would probably need to be a connection with an origin point, a termination point and a strength. And that would give you a representation of a connection from one part of the brain to another part of the brain. But if you're actually kind of thinking about how you would build a database that would represent that data, you probably want something that looks more like this. This is a kind of cartoon of the data from a track tracing experiment where someone makes a tracer injection of something like HRP or PHAL into a part of the brain and then finds labeling. So and the point is that if you are building a representation of the data, you're kind of interested in this as the kind of the piece of the puzzle that you want to process and use, but you have to derive it from this. Otherwise, you have no idea whether or not this is accurate. And so the way in which we actually do this within the KFED model is you draw out your protocol like this with the various different observational measurements that the parameters of what... So at the injection step, what kind of chemical are you injecting and where do you inject it? And then down here at the labeling step, what are the measurements that the labeling actually provides you with? And then once you've done that and you've built... And this is the first kind of implementation of the KFED modeling approach that we built. You have a summary table, which in this case is a large scale connection matrix actually derived from the limbic system in the rat. And if you click on an individual point in this matrix, you get a summary of all the evidence from various different experiments that support that data point. Okay, so this is a general purpose approach that actually could be used pretty much in any kind of discipline. Making things a little bit more complicated going on from neuroanatomy, which is a little bit more tractable and a bit more regular, to something that is generally quite flexible, neurophysiology. If you were doing an experiment on gene expression, which is a colleague and friend of mine, Arshad Khan from UTEP did, what he did was he infused neurotransmitter into the hypothalamus and then looked for gene expression in other parts of the brain. And this is a fairly standard protocol. I'm not going to explain it in detail because we don't really have time, but this is an accurate representation of what they did at a kind of intermediate level, not going down to the really deep representation of every individual step, but kind of just a high level view of now we did immunohistochemistry, now we did in situ hybridization, that kind of thing. If you look for the output of this experiment, it's the signal intensity of an in situ hybridization study. If you take this point down here and then trace back through the protocol, as I've shown before, the relevant data that really describes the output of this experiment, the crucial piece, is this relationship where you basically take, where he infused neuroepinephrine into the hypothalamus and you've got gene expression of phosphorylated urk in comparison to the vehicle control, you find an effect. So in fact, this pair of tuples provides you with the data that's an accurate representation of the outcome of this experiment based upon the variables and the measurements that were made. So in that respect, it's a very powerful approach. Okay, and just to kind of really push that point and emphasize how powerful this, how expressive this approach can be, we did some work in immunology in fact, so this is not neuroscience, trying to look at studies of macaque vaccines against HIV. And basically this is really fantastic database from the Los Alamos National Labs, which is a manually curated stack of papers where about 700 or so studies where they looked at the effects of vaccines against HIV in a whole bunch of different papers. And the data that's output from that consists of assertions that look like this. So this is the kind of the interpretive statement across the top of the graph. This is the graph that it's referring to. And this is actually a graph from a paper. And you'll notice that this is hideous. This is talking about percent-specific lysis, which is a measurement of the effect, basically the effectiveness of the presence or absence of an immune response in a specific asset. I'm not going to go into the details of where that comes from, but just take a look at that and imagine if you were a graduate student in a journal club and you had to read this paper and make sense of it and present it to your peers. You'd probably be sweating bullets at that point, right? Now, if you have in your back pocket a KFED model of the entire experiment, which is quite complicated, basically to take the animal, they vaccinated over these things, they expose it to HIV here, and then you do follow up, and then you basically see how well the animal's immune response copes with that. The actual data that's represented in this graph is represented is this measurement at this point. And if we do what the KFED model determines that we do, we go back through the provenance of the protocol and we keep track of each one of these different parameters, then magically, every single parameter in our model is actually represented on this graph as an axis, as a labeled axis. And this is something that we actually saw after we curated the model and after we built it. It wasn't something that we actually deliberately tried to code. And so I feel that this illustrates, it's not proof, but it's illustrative of how powerful this model can be in terms of providing a knowledge representation that can capture data from any kind of experimental design. Now, having said that, it's not quite as simple as that. Naturally, what this does is it provides you with a framework for capturing the primary measurement from an experiment. Now, as soon as you get that primary measurement from the experiment and then you do data processing on it, it transforms the signature of the data. And of course, well, it's kind of obvious if you think about it for a second, data processing transforms the structure of the data. But in fact, what we think we can do, and we haven't fully built this yet, but we think we can do this, is we will be able to track the way in which the various different data points get consumed and get transformed in the process of doing the data transformations using the kind of approach that people use within workflows as a way of capturing and representing the underlying data. So, you know, that's an assertion and a claim that I'm making. We haven't really actually filled that out yet, but I think this is where we want to go. So the idea is that we want to use the CAFET formalism to provide a structured representation of primary measurements, mean values, statistical effects between groups, and of course, correlations. And we think that this is actually a formalism that can be used to standardize the approach as nanopublications as IDF, eventually down the line, in a very structured and reasonable way that scientists can understand. Okay, and that's the key thing. We want informatics tools that basically allow a scientist to look at one of these different protocols and say, oh, I can kind of see how this works. I can understand this. I can actually build a protocol that then could be part of a system. Okay, so let's move from that into neuroimaging. That's the basic introduction to the idea. Of course, within a specific domain, we want to keep track of individual variables that make sense, that are standardized, that are useful to people in the community. And Jess and I wrote a paper that is actually being, that's impressed at the moment from neuroimage, so it's been accepted and it is actually available online, that describes this thing, the ontology of experimental variables and values, or OOV, which is a name of an ontology that doesn't sound like names of any other ontologies which was delivered because we're kind of playing around with that. And it's actually represented at Bioportal. And the idea is that it's not a fully-fledged ontology, it's an ontology design pattern, which means it's a lightweight capturing of the various different classes that are designating for a specific task. And the goal here is simply to kind of capture the data semantics. So the idea is you want a specific experiment, you want to define a specific experimental variable that measures a well-defined quality that could occur outside in the ontological world somewhere else. So someone else defines quality such as handedness or hair color or gender or such like. You want to then be able to define a variable that measures that quality. And we've made a distinction by saying that the variable is separate from the measurement scale that determines the type of data that goes into that variable. And essentially we'll go through a couple of examples that illustrate what that means. And then basically each measurement value has a scale associated with it that determines how you use that data and how you kind of integrate data across these various different things. So types of scales that we have, and this is described within the paper, so please feel free to refer to that. And it's actually quite simple. We're not trying to do anything particularly fancy here. We say, okay, well, when data is true or false, it's a binary scale. If we want to give the true or false values names, then we use the binary scale with named values scale. And we make this kind of as complicated as we need to going down to things like hierarchical scale, relative scales where you want to define something in relation to other things. And essentially the idea is this list of scales that we have is supposed to be very general purpose but extensible if you need to kind of invent a specific type of data point that is completely represented, that is completely unique to your tool and your system, then feel free to do it, that's fine. And we're kind of factoring in methods that allow you to extend that without too much difficulty. Once you actually have your variables, once you have your scales used based upon those kind of very low level representations, you want to put together your variables. And the paper that Jess and I wrote, actually all it does is we took one specific study from 2005, about an auditory oddball task which is designated in COGPO and we built and we basically went through the variables that were used in that study and tried to classify them and put them into this schema so that we could then build representations of the underlying experimental design. And you can see that here are some of the variables that were used. An important one here is the global rating of severity for hallucinations, which occurs within a standard set of tools that actually do that job. Okay, and then once we have those variables, as I said before, the variables are designated scales. The scales often relate quite closely to the variable definitions as well. And you know, we can go through the details. It's basically a curation exercise of trying to model individual variables. And the fact is that the idea of this is to keep it relatively simple, tractable, and not at all arcane. It's not something you have to be a trained ontologist to do. Anyone should be able to kind of do this process. So I'll finish by talking about this concept of liquid networks. And I'm incredibly encouraged by the discussions that we've had in this workshop because everybody seems to be living on GitHub, which makes me very happy. And the reason I say that is because I don't know if anybody has read this book by Stephen Johnson, where good ideas come from, which is a fantastic book about innovation. This guy talks about, he basically says, you know, we shouldn't think of the process of innovation and creation as something that involves one guy struggling with an idea in a room, staring down a microscope and coming up with something brilliant. We need to create an environment where everything's messy, where people can interact very easily, where ideas stick to each other very conveniently and straightforwardly, and we can kind of drive these things together. And the idea is that code and software is the place where we can do this. So within the work that we do, we have this tool, this framework called the View Primitive Data Modeling Framework that allows you to design a system based upon a simple UML-based model of your data structures. You basically run a Maven command and then the VPDMF will generate ActionScript code, Java code, and MySQL database code that fully flushes out that data model as a set of services that can talk to a client very easily. Once you've built this kind of architecture, you can very easily construct an actual skeleton of a fully-fledged web application, which looks like this. So the idea is that if you have a schema like this, which is an FTD, here is a full text document, a PDF file, you can go from this very easily to something like this. And this is actually a web application for a digital library that we're trying to present and use as a way of doing bio-curation more easily. So the idea is that you load up your PDF files into this and the system can then allow you to curate information by simply dragging and dropping and selecting text in the document. These are annotations drawn onto the document. You can easily add them and remove them. I'm not going to spend too much time on this. And so the punchline, the last thing I want to leave you with is this notion of what KFED should allow us to do. This is the goal of what the KFED system is all about, is to try and accomplish this. We want to be able to allow you to build a database or knowledge representation of your data by using a tool such as this where you drag and drop OOV elements from a catalogue simply onto a panel and draw out the protocol that you think as a scientist you're interested in using. The system will store that and actually provide you with data templates that you can then populate that then allow you to generate RDF-based nanopublications and assertions about the kind of values that you're looking for. So this is speeded up by a factor of four. I'm not normally this fast. You get the idea, right? So the idea is here I'm just pulling things out of this ontology store and drawing out my protocol. And if you can see it and understand it, what this is doing is basically you're doing an auditory OOVable task with an MRI study and then basically here I'm going to basically very quickly, in about half a second, deposit a process with you. So you do an ANOVA process over an individual MRI dataset and you generate ANOVA results. So that's pretty much it. Thank you for your attention. The final thing is just to talk a little bit about so this is my profile on GitHub with a little bit more facial hair. I think this is the liquid network. This is the place where we can integrate and work together and really be innovative and create tools and systems that really transform the subject and really make things available for other people. And I would really like to see the kind of interaction and the way... I would like to interact with people in this medium in such a way that we can just easily get things done in a very practical way. And I think software and GitHub and open source is the way that we do that. So thanks for your attention. Thanks to these guys for helping and obviously for the funding that made all this work possible. Thank you. Any questions? I know I'm keeping you for lunch. So my apologies. Hi. Thanks for the nice presentation. Thank you. My question is actually a little bit different from your presentation. Are you also considering to minimize the level of complexity which comes when computer science people interact with the neurobiologists? If you are working and designing a software for a scientist then it's very hard sometimes to deal with them and to make them understand that how we will understand to develop for them. Yes. Or they have to understand these... You have exactly hit it on the head. So the whole point of the CAFED system is an abstraction of the process of planning and executing an experiment in the language that a neuroscientist can understand but also in a framework that a computer science can work with. So a computer scientist doesn't have to learn in neuroscience to be able to understand what a CAFED model is all about. It's just data. And so I think that that's the necessary kind of matching function is to find the abstraction that best describes an experimental protocol and experimental data that's accessible by both parties. So you hit it on the head. That's exactly what we're trying to do. Maria? So I think considering all the presentations I've been in neuroinformatics and in general what I have found is more towards what Tal presented which is, as we like to say, a big, broad, and messy beats small, focused, and shallow every time when it comes to data. So given Tal's presentation again, how do you think we balance the fact that anything that you ask the researcher to do is extra work on top of what they are already doing which is already extraordinarily stressed for time. And if you listen to the data people they'll say more data will give you something much better than just taking very sort of accurate modeling of smaller things. And I think there was a little element of that and even in Angie's talk which says we talk about all of these things but we really perhaps don't need to account for all of them when we are trying to interpret our experiments. So it's manic web, a lot of the work that people have been doing calls and everything. This has been going on for a long time. Do you think the tool support is now ready that this will be easy or is there a balanced approach that's necessary? Good question. So I think that the approach that I take in this matter is that I am a software developer to try and help the scientists do their work. So in other words, I'm not trying to I'm not going to ever tell someone this is how you should do your work. I'm going to say go do your work and hopefully you're going to use our might and do you want to keep track of your literature using the digital library? Oh look, you have the functionality of being able to annotate the digital library, your PDF files with terms does that help you? And of course, if it doesn't help them then we don't do it anymore. I think that one of the things that we actually have to do as a new informaticist is think about and it's a really challenging problem actually figuring out what is going to help the scientists think of new problems and of course it's not just a question of coming up with the perfect knowledge representation it's much more important to have something that's usable. So this is the reason why rather than building ontologies I build infrastructure that generates software tools because I think that this is something I want to be able to generate a digital library application that anyone in the community building a web application could easily pull into their system and have functional in their tool. So to answer your question I think that let's not do either let's not build something crappy that we rarely, I mean not to demean anything that people are doing because it's all awesome but let's try and find the middle way of building practical tools that can scale, that are accurate but are also in the realm of the neuroscientists so that they can use them and I think the third issue is the hardest issue of all. So that's I think that's been a central aspect of the design work that I've done in terms of building this. Thank you. I think also getting to this last question or relating to this last question that the professor Honeydews of this world can be or are getting more and more convinced that you need this type of processing your data and making them available because it enriches your data what I would like to add being one of these Honeydews is that it would be very nice if these different programs can be related to each other so that the researcher doesn't have to input his data in different formats to have any use but that these different instruments can work together or can be translated to each other. I think that's an essential component and that's really a programming problem I actually know that the challenge of information integration is a research problem in computer science and actually NIF leads the way in terms of being able to bring together different data from different sources and integrate things so and so the answer to that is under the hood if we work together with each other's code and we kind of just take care of that and obviously try and make things work seamlessly then we'll have done a good job but obviously that's not an easy thing but I think again coming back to it our job is to help the scientists do their work more effectively and if we simply that if we make their work harder they're not going to use the tool so we have to make the so as a prerequisite to being able to do anything in neuroinformatics we have to actually deliver functionality to the end user in an appropriate way.