 Okay I guess I'll start. Before I start, who was at my talk at 12.30? Okay, there will be a tiny bit of overlap. I'm at a no-ceful conference and I'm explaining about triples again and again. Okay, so, I'm apologizing to this hands. Next question, who is kind of familiar with RDF in this audience? Oh, that's good. Okay, that's more than half. All right, but at least the other half, we'll do a little bit of tutorial work to explain what triples are all about. Okay, so my name is Jens Aspen. I work for a company called Frans Inc. And our main product is an idea of graph database. And I will talk about 20 seconds about after my company, then I will go a little bit into what a graph database is. And then we'll start with an example. I will try to explain how a triple store is kind of a specialized form of graph database and what the difference is. And then most of my talk will be about some use cases of an idea of graph database. So first we'll talk about supply chain management for a car manufacturer. We will talk about a reporting platform in the oil industry in Norway. And we talked about a telephone customer called MNOX that is building a platform that knows literally everything about you as a customer. So those are the three main use cases. I'll talk a little bit about why people use graph databases when you should use them, when you shouldn't use them. And I guess that's about it. I like interactive presentations, so if you want to interrupt me, be my guest. I like that a lot. Okay, so we're a company, started in 84, always been in the LISP and AI business. And for the last eight years we've been fully engaged in semantic technology. Started out of Berkeley and we're now in Oakland. Then about graph databases, I kind of had prepared a lot more about graph databases, but I see that this no-SQL conference is actually a graph database conference. So anyway, I don't have to say too much about it. I'm not even going to explain it, you know, things linked by links. That's about it. And there's a lot of graph databases. This morning there were some talks about it. So again, it's a big list of products that are called themselves graph databases. And it's kind of a mix between a memory store and non-memory stores, RDF, triple stores, and graph databases from the Wikipedia. Now what is the difference between a relational database and a graph database? So for years I've now been using this little example. I say if you want to represent a person in your database and you have a bunch of one-to-many relationships like the fact that you have multiple spouses in your life, multiple schools, multiple children, et cetera, et cetera, before you know it you have a whole bunch of tables and link tables. In a graph database you can represent the same information as nodes with links in between them. So it's not rows and columns, it's basically triples and nodes and links combining them. How is it different? Why is it more flexible? Well, in a graph database there's no schema. You can say whatever you want to say. There's no link tables because one-to-many relationships are just native to these graph databases. There's no indexing choices because everything gets indexed. That's not entirely true, but let's keep it as for now. And it's a very low-level representation of your data. So basically you can take almost everything else and turn it into a graph representation. And then the triple store is just a specialization. So if you take a look at the nodes and the links in an RDF triple store, then the nodes are always URLs. So you basically can put them out somewhere on the web and dereference them and get more information back. And the predicates are URLs too. So these URLs are unique. You can have databases with RDF in them all over the world. And as long as you use unique names, you can combine them and you can create a knowledge graph. So let me give you a little demo. I'll do a different demo for the people that saw my demo at 12.30. Usually I give a great demo from the pharmaceutical domain, but let me do one about news right now. So who knows about the linked open data cloud here? All of you by now. So you probably also know that Facebook, being a Google, are building their own proprietary knowledge graphs where they take all the people they can find, organizations they can find, places, important events, and all link them together so that you can make incredible big encyclopedia of almost everything in the world. You can use that to make your search better, to give better answers, et cetera. But there's also been a public version of that for a long time now, actually before the knowledge graphs from the big guys. And we called it linked open data. And if you want to have a great explanation of linked open data and why it's so fantastic, go to Ted Talks and search for Tim Berners-Lee. And he will give this wonderful talk about why we should take all the data in the world, go to RDF, make sure that the names link up, and we get this web of data that will give answers to all our questions. And again, what I already said, the most important thing is make sure that all the nodes in your graph are unique identifiable URLs. So this movement started in 2007. We already had a whole bunch of databases. And the middle dbpedia has more than 300 million triples for the English language. It's the triple version of the Wikipedia. Who does not know about dbpedia? Okay, so you take the Wikipedia, you take the info boxes and some extra, and you represent all the information just as triples. And then it gets republished. Now, you see links. So, for example, there's also a thing here called GeoNames, which is about 150 million triples. And it's the triple version of geonames.org. GeoNames is an organization that has all the places in the world with their alternative names in other languages, with the latitude and the longitude, how many people live there, the classification of that particular place. Yes, it's a huge database. And what you see is that there's links between all these things. So, for example, dbpedia, we'll talk about Berlin. But one of the triples will say Berlin has geonames ID, and then the number, and then the geonames database will have an address with that number, with a lot of extra information. So all these databases kind of link together. And so this was in 2007. This is now in 2010. The 2012 is too big. I can't show it on the screen anymore, at least. And this is not readable anymore. But this is by the way also awful colors compared to my screen. So this is probably 20 billion triples all from the pharmaceutical domain. So all the genes, descriptions, protein descriptions, drugs, diseases. What do you have? What do we have? This is publication data, all the green stuff, which is not green for you, I guess. This is multimedia information. The government is putting a lot of efforts in taking their data sets and making them available as RDF. And even stores are now putting all the inventory out as RDF. So it's a big movement to make data available. So how can you use that stuff? So, and this is my demo. So I have here a tool called graph. It's a free tool. You can download it from a website. And what it does, it allows you to load data sources from the web, put it into database, and then explore. So at some point I made a new triple store. At some point I loaded some files from the web. And so what I did for this demo is I took... So we have a crawler that you can kind of intelligently steer. And so one of my hobbies is to take all the politicians in the United States and I go every day to Google and Bing. And I get all the new news articles about these politicians. I collect them. Then I apply an entity extract to get the place names, people names, organization names, and other important things out of there. And then, for example, for every place name, I link the place name to geo names. So now I have to let it in the longitude. And for every person that I find I go to dbpedia or to contacting the congress.org which has all the politicians names. And so I enrich my data sets. So I go from just a blob of text to more precise places, organizations, people as triples, link that to other data sets. And suddenly I can do very intelligent questions on these data sets. And this is one day... This is not about politicians. This is just one day of Google News. And so this was in 2010. So I can look for, say, Obama in health care. And this is for one day. I found three articles. So now you see three triples that have the words Obama in health care in it. I can look at one of them. And here you see some examples of triples. So text 177. Look at the bottom, by the way, that it's a URL. This is just a read all for you. It has a concept, Cobra. I guess all Americans know what that is. This is some triples that describe the categories of this particular text. At the bottom you see some people that I found in this particular text. And then I went to dbpedia to find more information about each of these people. So I got some... I got some link data names back. People that I found from dbpedia. Say Mike Maronna. And here I see more information about this particular person in my database. I also have a bunch of place names. So I can go to a particular place that I found. And I linked it up with geo names. And I have more information about that. Does that make sense? So I can go back to the graph view. I see these texts. And I can say, OK, I want to explore the graph in this particular... Well, let me say it in a different way. I want to explore the graph, but I want to choose the way I go through my graph. So I say, well, I'm only interested in the people, the places and the organizations. Both going out and coming into a particular concept. So I choose those. And now I can click on this guy here. This is such a small screen. Let me just do this. So these texts are related through these things. Yes, so we see these different texts. We see how the graph connects. I can take a completely different topic like the CIA. Somehow there will be a few texts about the CIA. And I can say, so how does this text about the CIA ultimately link somehow to the text about Obama and healthcare through these three predicars that I chose? So I just can say, well, let's see how does this guy relate to the State Department. Well, that's not very exciting. This gets really more exciting. You get the point. So I can ask the database to find connections between concepts and find a shortest path based on what I want to see in the shortest path. So this is the graph part of the data that I have. Now let me show you why I can do some more intelligent queries. So first I'll show you a Sparkle query. So Sparkle is the query language of the web. Basically, if you know SQL, you probably can read it. You say, select every distinct text and X where the text has a name X, and this X has the wordnet type scientist. So when I look that, do that, I can do a query. I find a number of scientists that were in the news that particular day. Now please realize that this is already powerful because I just took a text. I took the people names out. But the text, of course, doesn't save every person that's a scientist. But I figured out it was a scientist because I took the person name, went to DBPD and found their role in some particular way. So this is already something where you show the power of enriching your data with other data somewhere else in the world. I can go deeper. I can do this. I can do this query. I can say, well, give me every text that has a city. And then, well, give me every text that has a place name that is within 100 miles of Tampa. That is the first line. Now for each of these texts, find if there's a scientist in the data. And if there's a scientist in the data, then give me the title of this article. So this is a query where I now do two things. One is I look at the role of people by looking somewhere else, but I also looked at the latitude and longitude of place names. So I can do now geospatial search two in the same query. And I can do the query. And then there's only one guy left here from text 540. I can look at this text. I'll actually do it with this. So this was the text about this particular guy. Anyway, you get the point. I hope any questions about this so far? It's a good demo of the integration possibilities and hot discovery. You mentioned Tim Berners-Lee. And one of the examples he talks about is the Deli example, where you've got these enormous numbers of columns, and you can filter on any selection of those columns. Is there a similar capability that the gruff or legograph might have to do that kind of filtering as well? Yeah. I mean, yeah, it works straight out of the box. Straight out of the box? Yeah. We can talk offline about it. There's a lot to say about that. Sorry, yeah. What about issues of data scrubbing? You've got all this data in the system. How do you gauge its accuracy and kind of put it in? But aren't there ways to cross-back those reviews? Well, there's lots of data scrubbing going on because one is if you get the text in HTML pages, there's so much junk around it. So it's already a hard task just to get the text out of it. And there's actually a great new company startup in the Bay Area. I forgot the name, DIG. What they do is they apply visual recognition techniques to a web page. So it's like a human being looking at a web page and saying, this needs to be the text that people want to read. Because if you look inside, you see all these divs and God knows what. And it says, anyway, that's one technique that we're looking at to make that more fine-grained. But the other thing is how do you find Bill Smith in your text? How do you know which Bill Smith in the DBPD is this? So there's all kinds of techniques being developed. You look at some of the words around Bill Smith in the newspaper article, then you go to the DBPD to his description. And if you find the one you do, what's the name of this technique? Anyway, you take the words around for that person this, and then you look at the... You're talking about gaining a confidence level? Yeah, it's just about confidence level. It must be this particular, it must be the boxing guy or the politician or the entertainment guy. But there's definitely a lot of work that goes into making this more and more reliable. So this is... Is this okay? So this is one little use case, but this is... And by the way, I have this great politics demo that I sometimes give. Politician and lobby groups are really interested in this particular technology. But let me go back to Murray. So who uses this in the enterprise? Yes, so what companies use this in the enterprise? Well, about half our customers are in DoD and the intelligence agencies. And we're now, right now, at version 4.8 with our products. And we've had customers that bought version 1.0 and 2.0 and it was still almost like a toy thing from version 3. We had the first production, now we're at 4. So it's like we've got a lot of help from the DoD community just keeping us alive and getting better and better at what we do. Lots of stories here, but let's not go in there. And then there's commercial customers all over the place that's in hospitals are getting really interested in this like medical device management or getting a 360 overview of your patients. Media companies like to... I have a lot of metadata about the products that they have. Pharmaceutical companies have by far the most complex data that you can imagine. They're all interested in this technology. But I'm going to talk about these three use cases. MDocs, a telco platform that knows almost everything about every customer in real time. About a car manufacturer. I'm under NDA, I can't say too much about it but this is a great case. And about Epim, a reporting platform for about 31 oil companies in Norway. So let me start with MDocs. Telephone companies are almost the scariest companies in the world from a privacy point of view because they know who you talk to. They have your location. If you take the location of people and who they talk with and you have good GIS databases and know where things are then basically you can figure out where you live. Of course they have your billing address and where you live. How many people live in your household? Your gender because you know the gender because you know kind of stores you go to your religion, friends. And that's of course all. But what they also want to know is if I'm going to call the call center what are the things that you're going to call about? Because I mean telephone companies lose a lot of money by just wasting time with the customer trying to figure out what's wrong. They now can figure out in great detail what you're going to call about even before you call. So anyway, what we did there in this project is you take information from more than 40 different databases. Events happen there. As soon as an event happens you turn it into a few triples that describe that particular event. It goes into a staging area. And for each event you kind of recompute the state of your customer and so we get up to 5,000 to 10,000 triples per customer that describe everything about you. And it's not just the simple things but it's also subjective information like you're a good payer you're an angry customer or patents like this guy always pays 5 days late but at least he pays. You have trends, you have geospatial things you have temporal probabilities absence of occurrence, etc. So it's a marketers dream. How it works is that I'll just go very quickly through this and I'll put this on the website too so don't worry. We started all these datasets structure data, unstructured data but whenever something happens we instantly unify it by taking a few triples we put this in this particular staging area and then whenever something happens we see oh this is an event for a particular person we get all the triples we have about that particular person get it into memory we apply hundreds of business rules and those business rules will create new triples or delete triples about a particular person and then we also take the state of the customer and put it through a Bayesian belief network to do some predictions like is this guy going to switch from telephone company A to telephone company B what are the three most likely reasons what you are going to call about when you call the call center will you actually pay your bill, etc. When you say you apply hundreds of business rules are these business rules that you have collected and you have to identify or is it that you already principally put all of these business rules in the product? I know for every industry you have to develop the business rules so you start with the basic concepts like yeah this is basically it sounds like you can define business rules yeah but you start out with we have concepts like a bill and a payment and a customer and a device and then that's low level that's something we turn into rules so that the people that write the business rules don't have to go every time to go all the way down to the level of triples but we create high level concepts and they have a mechanism where we in fairly high level language can describe policies of what to do and when I say we apply hundreds of business rules it's actually not true because if A calls B then we say oh we only have to call the rules that are about people calling and then we update the frequency of your social network we apply the current balance etc etc this makes sense but all in all this is four or five hundred rules in the system okay so that's the first use case any more questions about that? okay the second one is risk in the supply chain management and we did this as a proof of concept for one car company and we see a lot of industries that all have the same problem so let me just describe it so your car company and you see that there is a a flood somewhere in the world and you ask your vendor for parts like is this going to affect me and the vendor always say oh no don't worry no problems until one day they say sorry that's that particular part because you're competitive about it all so because you have this just-in-time logistics for car companies but a lot of other products you're really scared of anything and disrupt your supply chain so this is a list of questions that almost every manufacturer is interested in anyway I'm not going to read that list can you go through a hypothetical where the earthquake took place here and how that would be affected those would be good questions to ask yes I'll get to that point or let me- I think I'll get to the point if not then interrupt me so we wanted to be able to answer these questions and so for that we had to bring together three clouds of data you start out with the bills of material for a particular car yeah so these car companies have a bill of material they have actually nice booklets that describe every part yeah it's like a hierarchy a taxonomy of all the parts that they have and then they know for each part where they buy it from so they'll call them the first tier first tier vendors and you even have a parts inventory inventory so that's one cloud of data that you have to bring together and link up then if you look at the supply chain for the first tier vendors now the problem is your first tier vendors won't tell you where they get their stuff on so this is actually a really hard problem if you're called Walmart or you're called the United States Air Force then you can ask that question and get an answer for example if you look at the nuclear engines in a submarine then they actually know down to the last crew from what steel mill this particular crew came from because they really want to know all the way down all the people in the middle but for most of the world it's really hard to figure out so that's a lot of work who sells the sub parts and then if you get the sub and you go all the way down from the sub vendors to the producers you also want to know where they are where they are located so that's the second cloud and then finally what you have to do is what I just described in my demo spider for every producer and sub vendor and vendor on the web every day and get news about these vendors but you also want to spider the countries like you want to spider Japan and Thailand and whatever every country in the world and you want to look for problems with commodities you want to look for natural disasters, you want to look for political unrest and you bring that together in the database so this is just a picture of how this all ties together in the supply chain it's nice but let me not talk about it and then what I'm going to show you is how this all ties together so let me just show you we have a company from a vendor who buys it finally from someone in Bangkok there's news and there's a newspaper article that says that there's floods in Bangkok and the vendor is in Bangkok we have a risk so this is I just showed you graphs so you now recognize what's here on the screen so you go all the way down from the particular part to the in-between company this is by the way made up data in this case to this particular part to where it was made to Thailand where there was a particular newspaper article that talked about flood in Bangkok and then this is a kind of rule that we have so we say there is a natural disaster warning for a particular vendor in a part if there is a text for a particular country where this country has a place where this place is a vendor and where the vendor produced a particular part does it make sense? so this ties all these things together so now I go with my query through all the three clouds to kind of figure out that there is a particular risk has has danger word means could be like a political unrest or a flood or anything does it answer your question? yes, but I haven't done it but you also want to do simulations where you say okay when you're speaking of clouds does that mean you have two different data sets that are maintained by different people that you're querying and pulling together through this query? when I say cloud I mean I basically mean there's three areas where you have to collect data it's in the vendor part it's in this big world in between and it's in the world of news that's why I mean three three domains where you have to do all kinds of different work in which case it could be hundreds of data sets oh yeah, it could be hundreds of data sets so it's a scale out methodology across hundreds of data sets that are in different sets of servers well the first two clouds are just I'm not that incredibly big I mean that's just a few tens of millions but the text will get bigger and bigger but fortunately the text is also mostly you're only worried about the now you might want to do historic analysis but you're not interested in the flood in Bangkok three years ago you want to know about the last few days it's just a few 10 to 100 million triples will cover this particular problem now I've been thinking about this if you could offer this as a service to many industries and then of course you get a huge cloud then you have to track many vendors, many products many commodities, many disasters yeah are they free-spirited? no, no, it's all in one huge database well, I'm sorry when I gave you this demo it is federated but if you federate on the same machine then performance will be just the same federation over multiple machines is going to slow you down but if you can keep your databases in the same server on the same then we actually keep all these databases separated because it's much easier to manage if you just put one big massive amount of triples then you want to make big changes there's a lot of administrative work to keep it nicely separated then it's easier to manage how much more time do I have? 15 minutes okay so then another project that we did and that's now in production is for something pushed it's for a construction in Norway so in Norway and almost like in every country you have a bunch of oil producers that have rigs in the sea or in the ocean and they produce something every day and the government wants to know how much you produced at any point in time and if there were any problems in the oil rigs etc etc so but of course the government doesn't tell you what data format you have to give them so people give you a text or an email or an XML or a database so it's a mess and so EPM is an organization that's a non-profit organization in Norway that's kind of being sponsored by all the oil companies that has the task of taking care of the IT of this platform to kind of report to the government every day about the production so here the oil companies that all take part in this in this effort and so you use semantic technology you basically it's a really nice way to integrate data you start with multiple forms XML in this case XML Excel, relational databases you create a mapping from the XML to the triples you want to have so then you get mapping rules in the models you put everything in a repository and then we have all kinds of output templates to ultimately produce then XML or Excel, HTML, JSON or whatever you want to have and it's a very heavy ontology driven project who's familiar here with ontologies oh great I don't have to explain that but this we work with a partner here called Top Quadrant also here in the valley they have some really great tools to create ontologies we have this ultimate dream that you can build applications by just specifying all the ontologies on every layer of your application and then it should turn into a runnable model that's like so we apply that methodology in this project but it goes too deep to kind of try to explain it all now I hope this is kind of picture tells you some of how this all fits together okay so then the question when you want to a graph database or triple store or when do you use a SQL database or a noSQL database so I get this question of course all the time and at my talk at 12.30 I talked about it in great detail but let's redo it shortly here I'm here at the noSQL conference so if you're not using if you're not going to use many times when I talk to potential customers and they describe the problem to me my first question is why don't you use a SQL database your ontology is not that complicated looks fairly regular you don't make many changes over time so why not stay with SQL because if I get a customer that was happy with SQL goes to a graph database because some of the things relational databases do are really good they've worked for 30 years making joints very fast but if they can convince me that their application is really complex and there's a lot of changes over time and there's a graph in the data then of course they have still a lot of choices they can go to any kind of noSQL database or they can take one of the graph databases or you can go to a triple store so I tried to make this picture that a lot of people seem to like about when to use what so if you have billions of same type objects you need to retrieve them extremely fast then noSQL databases will really work fine if you have kind of a fixed size static data set you need fast graph computations a better matching then you can use the Cray solution Neo4j or LEGO Graph but if you need to work with if you need all the features of a relational database so you want to be asset and transactional but you also want to be ontology driven you need a lot of rule applications then a RDF triple store makes a lot of sense to use so here I'm trying to summarize it when you use a graph database where you need linkability new model knowledge and assets hundreds of thousands of classes of different features everyday new classes new features where you need ultimate linkability when you need pattern recognition and network analysis then you probably should consider looking at an RDF triple store and finally the last point when you need event processing we've spent a lot of time to deal with events where you have a geospatial component network analysis so we in our triple store we have a simple event ontology so we can have events like a meeting a telephone call a financial transaction every event always has a list of actors if it's a payment you have two actors or even three if you go through someone else telephone call whatever you can think of there's always actors there's always a place where something happens but most of the time it's got duration and then a lot of other things that describe events so if you live in a world of events then we provide very detailed social network analysis we implement almost a complete handbook of social network analysis so we can do kind of questions you can do here how far is P1 from P2 to what groups does this person belong how important is this person in the group does this group have a leader or not we do geospatial we're not as extensive as say Oracle geospatial what we do very well is proximity search so you say I take one point give me all the points that are within a certain radius that is what we kind of specialize in and then we do a little bit of polygons but not all the things that Oracle can do but what we do is really very fast then we do temporal reasoning so we probably know about allon logic for time if you have two intervals there's 13 ways that can relate to each other and if you add points then it's a little bit more what we do is we provide a a whole bunch of functions that will make it very easy to talk both about time and about place in the same query so we can do queries like this find all the meetings that happened in November within five miles of Berkeley there was a tender by the most important person who yells his friends and friends or friends so in one query I'm doing a social network analysis so I'm the first one I said give me the ego group around Jans two levels deep then give me the most important people back so the actor centrality is one of the measures in social network theory that says a person is more important in the group if he's more central to the group so anyway you find the most important person in every event for this particular actor where the event is a meeting and where this event happened in a particular time in the fall and happened within five miles of Berkeley I tried to do that in a no sequel store or in almost any other technology and it works really well in a graph database and I guess that was it thank you very much so any more questions I worked with smartle for a while and tried some rules with swirl in the past and performance was too tough for general rules so I'm going to add a date on this now but you have some kind of rules support what kind of rules support in practice what performance can you get out of it we provide full prologue in our database prologue is the perfect match for triples so I can do everything I can do sparkle plus a million things with prologue if you're not afraid of prologue at some point we're probably going to have a riff or some other but right now if you really want to we have some spin that we now support so spin is another way to do simple rules but if you want to do anything really complex then prologue is still the best bet and this afternoon I talked about a new in-memory architecture where so let's I'll talk to you later conference we talked a lot about scale out and scale out scale down can you talk a little bit about the features that you have in the product that support that so that was what my talk was about at 1230 so we're working on a horizontal scaling that uses the principles of Hadoop to distribute the triples based on the hash of the first part of the triple and depending on how you index and now we have a technology where we can take a sparkle query and turn it automatically in some really fancy map reduce so you can write declarative statements you don't have to write a bunch of Java code you just specify your query and it turns into a pipeline now this is something that it's in research we can show you demos if you want to but that's how we deal with that part if you we do federation so you can do you can have a bunch of triple stores queries against the triple stores of course that's not as fast as if you have a really distributed version they keep a list of the most important people around you plus the frequency so they very quickly can say who are your most important people but so it's just all in there and the marketers can use it for whatever they want to do yeah because one thing they found out if you can find out what the most important person is in a group of people and if the most important person buys something there's something like the very high probability that at least two other people in the group also will buy the same product so you want to be very nice to the most sure they take people with them and if they turn if they go away then they take people with them so it's very important that you know the social network around people plus you have these friends and family programs and you can make plans based on the fact that if you get your friends in the network, in my network then I will give you this discount alright well thanks