 I am Kalpana, co-founder of MetaOm Science Informatics, which is a company that works in bi-informatics and knowledge mining specifically in the life sciences. My talk today, and I am going to dwell a little bit on the title here, it says open data in the life sciences and open world. Open data might be a little bit of a misnomer. I will be talking about linked data, but if linked data is open later then it makes for a more open world. And talk a little bit about open world. An open world assumption in a data model is slightly different from a closed world assumption. In an open world assumption the interesting thing is that it is for triple A. Anybody can say anything about anything, which means you put your data out there in the web and you can connect it to existing data and you can make whatever assertions you want. So then what that really boils down to is trust. Do people trust your data, do they trust that linkage and then they will use it and if not so you need, whenever you put up something to associate provenance, where did it come from and keep it in a graph that can be included or excluded. So that's sort of what I mean by an open world modeling system. So I sort of got a little ambitious here. I will give, because I know a lot of people who work with linked data or still semantic web in this area. So I'm going to give a very brief introduction. It's very superficial but that's all I can sort of cover in the time. Then talk about our efforts in the life sciences and you know the kind of challenges we face and then finally show you a demo of our product. So if I put the word Mercury out at you, what does it mean to you? What comes to mind? Planet. Poison. Okay, metal. Anything else? Version control. Version control, exactly. Anything else? Thermometer. Thermometer correct. Okay. So it means different things to different people, right? So if you are a organizer of the art or a mathology fan, you would think of like Mercury as a figure in Greek mythology or there are lots of arts cultures that depict Mercury or it could be the planet as many people said. You know, the mix in a different context is for an astronomer or your astrologers talk about something in Mercury and something in Venus and so on. And then again temperature or the element which is Mercury as a method and then you know Mercury could be a power as well. So it really depends on what you're talking about. So here you got your version control as well. If you were a developer you would look at version control. And so this is a quote that I really like. You know when you look for something or you know you have words you usually associate them with what is your context. So if your knowledge and my knowledge are different then based on you know but you go to Google or any other search engine which is doing a keyword search they don't know what your context is or my context is. They will throw up search results based on what you ask. So what we expected might be very different from what we were really looking for. But what if you typed in Mercury in your search box and you were asked are you interested in the planet Mercury? Are you looking, are you a chemist looking to study the element Mercury? Are you odd chemist? You might be looking at structure or developer like you might be just looking at Mercury. So if you were able to ask a question like that and get an answer something like this but this little green ball here depicts Mercury and you are able to go to a database to the geodatabase and get all the information about Mercury. Get space ships, space satellite information about Mercury and then connect Mercury back to the solar system and be able to get more information about the solar system that it belongs to then you have, you see the planet in its complete context and you see the information that you are looking for because you were able to choose initially and say this is what I want or if I went into a little more, sorry, so if I looked at the solar system the solar system has an object which is Mercury which is an astronomical object it also has Venus and Pluto and Mercury has a type associated with it which is a planet it has certain atmospheric elements it has a crater and it has certain characteristics so Mercury would have certain properties associated with it which would be purely data and it would be connected to other objects which were associated with the solar system or a galaxy so this is the type of information that is stored in the link data space and this is the type of information that we attempt to search for in the link data space and all this will make a lot more sense to you if I go ahead and tell you how information is stored in the link data space excuse me information in the link data space is told as what we call a triple and a triple is basically nothing else but a subject which could be a person or an object like Mercury a predicate which specifies a relationship and an object which is another entity so for example a person could be related to another person a planet could be related to the solar system etc so this is how all data is represented in the link data space and here an example of this the solar system has an object Mercury and Mercury is of a type planet so these are assertions that you make about your data in that whole space so what this allows you to do is actually explicitly state a lot of things about your data but it has a flip side your data is extremely verbose for example if you went from the relational database table every cell in a relational database table is actually represented in the link data space by a triple so it's a lot more expressive but it's very verbose and here there's a little bit that I want to tell you about every entity in the link data space for example if Mercury we continue with our example it's actually stored in what you call a name space which is really a graph that is containing information about astronomical objects and that could be called say astro however if you were looking at Mercury in the chemistry space it would be stored in another graph and that graph would have a name chem so if you prefix every entity with what we call as a name space which is astro or chem then you are able to identify it as such so that's the sort of so you can really do cool things with this triple that's the nice part about it for example if you set a solar system it belongs to the galaxy Milky Way then it also implies that Milky Way contains a solar system so how do you, you can actually say you can actually declare that any relationship has a valid inverse relationship for example to give you another example if you say A, Mary is Sue's daughter then it implies that Sue is Mary's mother so you could say that her daughter has an inverse relationship which is her mother so you can state such things in your ontology so when you start searching your data you can actually start pulling up a lot of things which are not whenever explicitly stated in the first place so you start looking at relationships that weren't explicitly stated in the first place the other thing is transitive so we all know that the moon orbits the earth and the earth orbits the sun so the moon also orbits the sun because I don't have to explain that to you so in a transitive search you can say that the relationship orbit is a transitive property which means that if the moon orbits the earth and the earth orbits the sun then that implies that the moon orbits the sun so these are assertions that we make in our data that if this relationship exists then this relationship is implied and this actually makes it very expressive the information contained in your data and it's very easy to query your data so another really short history lesson how did we get here so if you looked at the initial evolution of the web web was all about linked documents but soon we started with a social web I don't think it's been around so much part of us that I don't think we remember a time before there was a social web Facebook, LinkedIn, all of that which was a participatory web where there were blogs and people talking to people but then what some people call web 3.0 is the web of really deep data that exists and a web of structured data where info data is connected at a very granular level and that data and the social web 2.0 which is the participatory web it is probably these two that will dictate the technologies that are there in the future so in God the semantic web would strike me dead if I don't show you the layer cake because every semantic web presentation has to have a layer cake but this isn't particularly useful except that it gives you some idea about the architecture of any semantic web application so at the very bottom of this is a URI or an IRI so every object in the semantic web has a unique HTTP URL associated with it for example if there's this person Barack Obama has a specific URI associated with it on top of that is the RDF and the XML layer the XML of the RDF layer is that layer of relationships that we just talked about where every URI or every entity is connected to another entity and then out here the OWL and the IRF layers are basically the logic that we talked about the inverse relationships and the transitive relationship and the rules that we put in so that we can actually infer further from our data and then there is trust or prominence how useful is that data and on top of this whole stack you build user applications so what are the principles of linked data basically and these are again if you are in that space you should start looking at it everything has a URI associated with it URIs use HTTP protocols so you can actually look them up when you have a URI other people linked take them to it and that's actually the most important and this is sort of like now the 5 star data badges that people have started giving to linked data these are standards that have evolved from the W3C which is put your data out on the web make it machine readable use non proprietary formats so that it can be accessed easily use standards and the W3 standards are usually RDF and then link your data to other people's data and this is sort of like a 5 star data dating so you know if your data follows all this then it's given this 5 star badge and people have started using this a lot so what has been happening in this space is people have started putting that data out there it started as a small cloud in 2007 where there was DBpedia and some geography stuff census data and things like that but this kept growing so in 2007 it became considerably bigger and the power guys got in here with open site that was very... as we went further this web clue even further and so it's kind of been explosive and right here in this space is where all the biology data sets and biologists have been sort of early adopters of the semantic web for the simple reason that they have been generating so much data they didn't know what to do with it they didn't know how to link it up they knew they needed to sort of link it up to make sense so that's the reason so semantic web in the life sciences and this is really what we do what is our vision of semantic web in the life sciences so it sort of has two parts one is there is a lot of data coming out of research and for those of you who heard Ramesh's talk in the morning he was talking about genomic data there's junk data coming out there's clinical trials data coming out there's a lot of data coming out and this data has to be linked for you to make sense of it because this data traditionally is all sitting in silos so now there is a huge push that you need to link biology data so that it can so that when it connects to each other you know knowledge maps will emerge and people will be able to look at new things and you know sort of be able to harvest the collective intelligence such as there in science out there but there is another aspect to this whole thing which is in the healthcare you know there are these buzzwords that go around nowadays in the medical profession which is like translation in medicine or personalized medicine and what this really means is is the medicine that you take even if it helps you is it going to help me even if I have the same disease it may not because you and I have different genetic makeup so what is going to happen as more genomes get sequenced physicians can actually make very focused decisions as to whether that medicine is going to be good for you or you know it's not going to work with your profile at all are you going to have an allergic reaction to it there are things that happen also lifestyle issues you know in the medication you are taking is it going to have an adverse effect with alcohol does it have an adverse effect with some other medication you are taking these are all issues that at this point of time actually physicians don't even consider it's kind of scary but if you go to a doctor a lot of questions I don't really ask you we just prescribe medicine because it suits some broad spectrum and then patients themselves you know especially people with sort of lifelong diseases often become their own sort of caregivers they need to know a lot of information they need support communities where data is stored so patients themselves so data needs to be linked so people can actually discover it and what kind of questions would people ask I mean scientists are constantly looking at old data to see if the drugs that they discovered like 40 years ago could be used for something else you know also can the drugs being made to work in a different way can it target some other part in the body some other proteins in the body and by looking at existing data can I come up with new hypothesis because when I link up these this data to new patterns emerge so this is sort of the whole science-focused patient I mean this is something you might want to know immediately you have you've gone out you've had a drink you have a headache there's a particular painkiller you take it depending on you know different things you could actually have like a fatal lethal reaction because a lot of reactions adverse drug effects have happened with acetaminophen which is close in with all of us solo every time we have a headache so you know you sometimes need to be able to find out these things so ecosystem right now is really driving towards this whole linked open data thing if you look at like Europe there's like the open facts pistosia alliances are a whole bunch of places which are pouring money saying put your data out there and you know we will provide the infrastructure so that's like a huge opportunity for people like us to play because we can then you know we look at that data and make sense out of it apart from that even two years ago you know audio standards were not mature enough to develop stable applications but now it's a lot better and then genome sequencing genome sequencing is the huge driver and I sort of talked about it in every slide but when your genome costs like and they're talking about thousand dollars to sequence a genome when that kind of data starts coming out then you know the amount of information that we can start dealing from this data is huge so these are the drivers from the ecosystem Pharma industry now Pharma always have to hold on to their data I don't all behind closed walls but Pharma pipelines are driving up a lot of patents are expiring in the next work I think 2012 is the time a lot of patents are expiring so what to do they have to put their pre-competitive intelligence to be able to actually come up with new jobs so what you know so now a lot of their old data is actually being put out there and big pharma like AstraZeneca they are also doing their data out there so these are the drivers from the ecosystem so what kind of questions can you ask them on this data if you had information about aspirin and its target protein in the body in one database and Tylenol which is your closer and another target protein in another database could you then walk across these two databases to see what are the common proteins between aspirin and Tylenol or do two drugs work the same way do they have common function do three drugs work the same way how does a drug work in a disease essentially you can slice so the whole deal is you can slice and dice your data in any way to get the answer that you want and that is where you know the whole power of this technology lies so are there challenges generally there are a lot of challenges in the link data space I am going to talk about the challenges in biology I am going to focus on two of them which is the semantic variation and trust but I will talk about other things when people make these graphs whenever there is a different context you have to be able to say that the protein in this graph of data is the same as the protein in this graph of data you should be able to match these it is like matching schemas across databases and it can become an extremely messy and inaccurate job and that is one of the big challenges the second challenge is actually a little easier but if you look at this thing which says CDK1 is a well known protein in biology it can be written in all these ways CDK-1, CDK space 1 and no big deal it is a synthetic variation it can be easily dealt with but this is where it is harder these things also mean CDK1 send division, control protein 2 homologous, where did that come from all these things being the same thing and this is because everybody who works with a particular entity in biology has biologists and naturalists but they are all the same thing so then how do you resolve this kind of ambiguity so this is like a huge issue in biology where main variance this second issue is actually an issue that is common against all the main data space suppose I looked at aspirin and I said what are the proteins that aspirin reacts with what are the proteins that this protein reacts with what are the molecular functions what is the drugs and what is the sequence then if I wanted to look at how reliable this data was then this path needs to be reliable this path needs to be reliable and this path needs to be reliable so the trust that you would place on this piece of data depends on the trust that you would place on each of these things along the path so then how do you calculate provenance how do you assign provenance to something and the only way you can do it is actually but there are different ways people have come up with different models and different theories but you know so you have weighted nodes and you say you know this if it comes from this node to this node you would trust it more but the way we have dealt with that is if you look at this D1 here it could be a bunch of sources which assert this fact that link aspirin to a bunch of proteins so it could be a set of you know D1 to D1 a bunch of sources we tell the user you choose the ones that you trust and then we calculate the path that way and we found that that's easiest but also this this is an issue that we haven't solved entirely I mean we haven't solved it at all we just left it to the user actually to do it and this is an issue that continues to play all data that you get from the linked data world and this is the whole issue of trust and that's a mandatory update that we looked at so I will move on to tell you a little bit about our platform we have taken a bunch of proprietary databases in the biology space converted them into linked data form and then we have an engine that actually queries across these databases and so it shows you the results and I'm not going to dwell too much on this because I'm actually going to go and show you a demo so I will do that right now so this little bio was developed by Meto and I'm just going to show you a quick demo to show you some of the things that I was talking about so we'll start our search with Metformin Metformin, do any of you know what Metformin is? What is it? It's a sugarcane drug Correct, yeah So yeah, I'm glad that you don't know it that means all your sugar levels are under control So Metformin is a diabetes drug and if I just typed in Metformin and hit a search I would get this kind of information about Metformin which is that it's a drug it's an approved drug, it's anti-diabetic, etc and if you look at this graph that's extended it would take Metformin Can you guys see it all? Is it visible? I don't know what you can see So basically if you look at Metformin it will give you all the patents that Metformin has the reference publications what is it used to treat for and obviously it's used to treat things other than diabetes what are its targets, protein targets and what drugs does it interact with Now let's just go ahead and do a new query I looked at Metformin and now I'm not going to look at protein targets but I want to cast a wide net So I'm saying that any protein that Metformin is related to in a loose way I'm going to type protein here and I'm going to cast a loose net and say any protein that Metformin is related to can I have the biological processes that those proteins are involved in So here I'm saying my drug of interest is Metformin I'm now building a new query So I know there's a lot of biology jargon here but I'm trying to simplify that so you get the idea Metformin interacts with some protein in a very loose way related to some protein and these proteins have some biological functions can I find out what they are I'll go ahead and search and the reason I did this was because you remember we said Metformin is used for diabetes but there's been a lot of talk can this be re-purposed to be used for cancer So as a biologist So basically one way to find out this is it enrolled in apoptopsis which is like a cancer process So I'm going to click on these biological processes and see and so if I look at this here it says it's working on DNA damage apoptopsis which is like the cancer process So it does look in a very loose way like the proteins that are related to Metformin are involved in the cancer process So could I extend this further So I go back and I modify my search and I click on the protein because that's where I want to extend this and say this flow So I could extend it from any node and I say are these proteins involved in cancer and so now I have a query that says Metformin has a bunch of related proteins these have some biological processes and is this protein also involved in cancer are these proteins also involved in cancer So where I'm dragging at this Metformin can actually be used to look at cancer and if you look at that you see that these proteins are involved in several types of cancer like colorectal cancer, gastric cancer, ovarian cancer etc So costing a very sort of wide and loose assumption you could start investigating the fact that maybe Metformin could be used for cancer So what I'm trying to drive at is none of these assertions were made in our data anywhere What really we did here was we looked at linkages across the data to look at patterns and relationship and start building initial hypothesis before you go into like going and experimenting with it So this is the sort of follow that the link data process really has You're able to look at a whole bunch of things that may not be apparent in a closed world system So various things you can do with this interface If I click on one of those proteins then I'm able to filter down these proteins I can now and it filters down these things too So this is sort of one of the things I can also say that I'm interested in some of these proteins So say this one and you have a favorite link biologist in the audience Any biologist in the audience that are interested in any of these So can you look at cell cycle proteins Sure I've got the go biological processes highlighted and I'm going to just type cell cycle Can you do the metadata for this? How did you get all the different data sets? So the data sets that we got are all public domain data sets What we do is we're extracting the metadata from different data sets and we do a mapping process where we map what we call a meta ontology and the meta ontology is built by domain people because the meta ontology requires a certain understanding of the domain certain relationships you will get directly from the data itself but certain things like those asserting which are the transitive relationships or which are appropriate inverse relationship that layer of metadata gets built by domain experts So we have an underlying spark we use virtue which is a data stock and we spark it to search and the engine that genders is popular that's ours so we built that the queries to the underlying data base that's ours we built that Any other questions that you were interested in? Could this eventually lead to analysis or you just give away the data and the linkages and then this is... Yeah, I can hear you The rest is left for the user to... Yeah, so first of all we don't make any claims about these assertions We're saying this is where we're getting it from and you could use certain tools and if you're a biologist certain bioinformatics tools on top of this and that is a close thing that we're beginning to integrate slowly into our... Blast and re-view Blast and re-view will be more or more re-view We already have blast and sub-view and we're going to do very basic things like class style W and R for statistical analysis that kind of stuff but we won't be doing the analysis that's up to you sir What I'm asking is that would the user be able to be in your system and by their system get the analysis done whatever you choose to be I mean for example he's chosen all these words like morphine and cancer So unless otherwise he has some background knowledge he's not going to go and check for it simple as that the researcher So would he then have something in your system to further go and find out the relationships you know the statistically significant associations Okay so right now we are doing the associations that he can find out What is statistically significant yeah it wouldn't be possible in a future release but we don't have that yet so that's the nice part of the plan that we don't look at what is statistically significant We can do something very similar to the page rank page on the graph that is available and the connectivity that's there so we don't know what happened yet and also you know public data has its ups and downs so I mean I'm actually open to suggestions here how does one rank those public databases Maybe we could discuss that So that brings to another question please so is it real time different No this is not real time because existing database Right so but existing data also updates right Yeah when those databases update to they update to Okay so So yeah but again the next thing that we've been asked is real time data when we go to the healthcare thing starts coming in from like patient records so will you be able to handle this and that again is something we think about for the future Yeah the other thing is people can actually put out their own data So when you said that you have to spark So that means you are showing the queries from Sparkle to various public databases Yeah sir Yeah so that's actually to say to the story right now because of these Sparkle endpoints being a little iffy we have aggregated it but we do have another machine where we can go to different which is called data federation So what I showed you now is the one most Any other questions Underline data to some other domain like said libraries Would the same search engine Yeah so basically the issues would be with the front end where we start auto-suggesting and the auto-suggestion is domain dependent The suggestions that are coming up So if you were able to tell us the vocabulary in your domain and the ontology was associated with your domain then the search engine would work We went and put it on top of a publisher's Sparkle endpoints and it worked because we were able to automatically extract their schema and then do that I can see that in a library also it would be very useful Yeah absolutely If you are reading one book you want to find out what are the related books There's a lot of different places right now we are beginning to start working on content of it also libraries would be right along that line Who are your potential users for something like this So potential users or anybody right now at the level that this is at which is only at in the science community would be scientists who could be in like large government labs cancer labs or people within the pharma and biotech industry So it's not something that encourages self-diagnosis Not at all Would that be useful I don't think self-diagnosis is a good idea at any time And I think this wouldn't ever lead to that either way No no no I mean you can't stop people from doing self-diagnosis I mean even I do it but it's not a good thing to do It's not something I encourage It's a gathering data to link the objects So Avram you want to take that because Ram is Ram my co-founder and he's been dealing with Yeah the question was what level of manual curation goes into it So there's a bit of things like for example the semantic and syntactic variations that Kalpana talked about So at some level it's very hard So we have our own certain indices which are sort of partly manually done and we try to automate as much as possible in the scheme of things but definitely yes there is a manual component involved in doing this Yeah it's not true Yeah so effectively you can rapidly turn around any data store once we understand the scheme Yeah but then what I'm asking is what is the level of errors you might find when you're suggesting something Right so the point is we do not make any interpretations of your original data So one of the things is that we deal with primarily structured data so we go to places like Dratlan, Cuenipraut etc So that is that you know that the column name or the attribute that you have explicitly tells you what kind of relationship Was that your question? Are you wondering how much errors introduced by manual curation It's a part of it basically So when you're not when you're automating it basically you might introduce more errors No actually if you do I mean you can argue the flip side too that if you're doing actually an automated process and you start sort of doing some kind of NLP or IR then how are you sure that whatever is coming out is right or not So effectively we don't we try not to do any interpretation of the data that is available and we deal primarily at this point of time with structured data so it's very clear as to what the relationship is and that's how we do it So for example if you notice the targets example that Kallitna was talking about Dratlan specifically says that these are the targets for X and Y drugs X and Y are targets for the drug whereas if you go to FarmGKB he just says it's related so we use the same terminology even to say that this is just a relationship it's an ambiguous relationship there's no causality involved over here so we do not take what is a cause or not But you talk about FarmGKB and then it gives you the actual statistical relationship which is directly useful to your research or to your project at hand Where is this So that's the other part I think we can then sort of go into showing how we display the sources so we can sort of show you how we do that So we say this particular source and what we're trying to do next is sort of to show actually where is there a publication, where does it come from Is there a score, the original database associated So the researcher has to go to that publication and we show it here in the context where it's available With the data that is given in that publication curated and the original database provides it, for example FarmGKB for saying that this is related to this... It says this is the paper if it points to a particular paragraph we can show it directly but he will not go and look at the original paper and say this is where it's coming from So we rely essentially on multiple data sources to show it So it's like a combination of a search in June and a data So it's an integration platform and we allow people to search and form pretty seamlessly We're not doing any curation of literature at all, that's not the business where we're at all So we're not taking unstructured data in general There are a lot of people who do that So that's a different business plan And unstructured data is a completely different one So we're using metadata from places like PubMed etc So sort of bring this time together by itself, the volume is very nice