 Okay, welcome. First of all, let me say that I'm alone here because although the talk was announced with two speakers, with Jörg from Isosphere and myself, he unfortunately couldn't make it so that you have to live with me. So, before I begin talking about Hadoop and Big Data, let me just quickly say what I mean when I say graph. I don't mean the graph on the right-hand side, the plot of a function. I mean the mathematical structure on the left-hand side where you have your vertices denoted by the capital letters and the circles, and you have edges, and edges point essentially from one circle or from one vertex to another. As you can see, the vertices can have data associated with them. Here in this example, just letters, and the edges can have data associated as well. So, where does data like this occur? Actually, in many places. Think about social networks where vertices are people and there would be an edge between two people when they are friends. Or think in planning, dependency chains where the vertices would be jobs or tasks and there would be an edge, say from one task to another if this task depends on the outcome of the other one. But also computer networks and their direct links could be described by a graph in science or research citations or books. So a book could be a vertex and it could cite other books and then the other books would be other vertices and there would be an edge if one book cites the other or scientific paper or whatever. But also hierarchies, trees, you name it, we have it. Everything essentially looks like graphs or can be described by graphs. And actually, if you think about relations and relational data, a relation is nothing but a graph. If the relation is on a set, then the elements of the set can be seen as the vertices and there is one edge from a vertex to another one. If they are in relation. So it did any relation is a graph, so therefore graphs really are everywhere. So not everything is a graph, but graphs are everywhere. If you look at these examples, some of these examples have their edges directed as in this example up there, for example. And others, for example, for friendship, you don't care the direction, it's symmetric. So we call graphs undirected if the direction of the edges doesn't matter and we call them directed if they matter. So that's what I mean by graph. And in many cases, you already have graph data, for example, in Hadoop. So you might have the vertices as one file in your HDFS in Hadoop and the edges in another one of a special format. And then you can just use Spark or graph frames or whatever and do commands like this to ask for things in the graph. So in the bottom, for example, looks to find paths of a certain pattern in your graph. So I'm claiming that in many cases, you are already sitting on a huge pile of graph data and you can already analyze it. So what I'm here for, well, the idea is the following. In practice, it turns out that yes, sometimes you need to have global information about your graph and run huge analytic queries. But quite often you also want to run rather small ad hoc queries, which only consider a certain part of the graph. I will show you examples in a second. Try to make, try to go to the next slide, to the next slide, here we go. And in these cases, you want to get down with your latency for the answer for minutes to seconds or sometimes even for seconds to milliseconds. So you want to have many little queries and in an ad hoc fashion. And for this type of application, these type of questions, Spark and the huge engine which has started to do such a graph query is probably not the right answer. And so let me give you an example, examples where you want to do that. If you have a graph with friendship, then you probably want to find all the friends of one person, see, of one person. Or in a dependency chain, you might want to have the immediate dependencies of one item or reverse dependencies, what depends on that item. Or in citations, you want to research all citations of a paper, or you want to find what paper is cited by what other papers and then change of these. Or in a hierarchy, you want to go downwards in a tree and find all the descendants or all the ancestors or something like that. The unifying thing here is that you don't want to have something for all vertices, but you want to have something in the vicinity of a single vertex. And then usually, like in these examples, you probably have a system where you want to ask many such questions and you want that the answers are quick. And my point here is that in this situation is probably a good idea to use a graph database. So let's talk about graph databases, what they are, what they can do, and what's so special about them and what opportunities do we have. Well, graph databases store graphs in the sense of the word graph I have mentioned. But that's not the full picture. The crucial thing is that they can query graphs. So storing data can be done in files, but what a database is good at is querying and finding interesting insights. And so it is with graph databases. They persist and store graphs, they can change on the fly, but mostly you can answer graph queries. So what are graph queries as opposed to other database queries? Typically, graph queries involve patterns for paths or exploring the vicinity of a single vertex or find the shortest path between two vertices or something like that, yeah? And if you look at these, first of all, what strikes the eye is that they are using only parts of the graph so they somehow look for local information. And the second major ingredient is that you don't know beforehand how many steps you have to go in the graph. Think about shortest path. You don't know whether there is a path of step, of size length, length one say. You might have to use a path with length 10. So the two crucial things here local study of the graph and you don't know how deep you have to go beforehand. And so the question comes, how does a graph database do this? And I would like to explain the essential idea behind the execution of such queries. And that is graph traversals. And I can only reiterate the crucial thing here is that we don't know beforehand how many steps we have to do. Good, let me show you a graph traversal. Here's a graph, it's directed in this case. The vertices are the circles denoted by the letters and I have edges going all over the place. Now, if I'm interested to study the vicinity of this vertex A, I can proceed as follows. I start at the letter A and down there where there's a little arrow, the red arrow, there I collect a list of vertices I have discovered. And at first I find myself at A, so I imagine that I am a little person sitting on this A and I first look what immediate neighbors I have. So the graph structure that is the edges going out of A tell me that the immediate neighbors of A are B and C. So what I'll do is I take them in any order, like for example, alphabetical order, and I put them onto my list of discovered vertices. So for example, from A I have discovered B by now, so I put it next to the A and then C because that is the two neighbors. I might throw some out if I'm just interested in a certain pattern of edges, but I might just have all of them. Once I have completed my work on vertex A, I simply go to the next one in the list down there which is B here and repeat the process. So from B I again look for outgoing edges and see which other vertices I can reach with a single step. In this case it is just E and I just put it at the end of the list I have already. Now I've now finished with B. Note that the arrow from B to D goes in the opposite direction and I'm just going forwards this time. So therefore my work on B is finished. I advance, go to C and proceed as usual. So you see this is a so-called breadth first search. What I essentially do is I start somewhere, look for all the possibilities where I can go. First deal with the start vertex, then with everything that I can reach in one step. From there everything that I can reach in two steps because if you watch this continuing, then you see I just move ahead like this and see where I get. And finally I have exhausted whatever I want to find. Maybe I have a limit on the distance or I have enough results or whatever. Note that when I was at D I did not again put B on the list. That is because I don't want to visit vertices more than once. There is of course variations to this if I'm looking for arbitrary parts of a certain pattern then I might have to repeat things. So there's variants of this but the fundamental idea is always the same. Start somewhere, go out to neighbors, repeat. Essentially a recursive procedure. That's your graph traversals. And behind all the graph queries I mentioned earlier are essentially this kind of algorithm like graph traversal sometimes coming from both sides for shortest paths, sometimes just going one direction, sometimes going against the arrows, whatever. So there's a lot of variants you can do go deep depth first or breadth first whatever but the essential idea is the same. Now so far I've just talked about graph databases but I have to admit that I am with a company Erango DB and we don't make just a graph database but rather we do something we call multimodal. So let me just for a few minutes deviate and say what I mean when I say multimodal database. A multimodal database for me is at the same time a document store for JSON documents and a graph database and a key value store. Now why is that interesting? Well if you look at all these graphs I have seen in examples and of all these examples of a graph data then almost always you have some kind of data attached to the vertices and you have sometimes data attached to the edges. Therefore it makes sense to store that data in the form of JSON documents and simply store your graph as a collection of documents for the vertices and as another collection of edges for each edge JSON document. So that's your storing of data but the crucial point again like in the graph database case is that multimodal database only makes sense or is more powerful if you have a single query language which allows you to combine the document model, the graph model and the key value model. And that is what I mean with this term multimodal database. It is just one core, one language but three different data models which can actually be mixed in a single query. The selection of these three data models is crucial so there's other data models which is not there for example time series or columnar store or whatever because these three fit together well and if you look at performance benchmarks and so on you can see that it is possible to build a multimodal database which can actually compete with specialized products. So there's no reason why such a multimodal database should be any slower or less powerful than a specialized document store. And likewise this can compete with graph databases on their own turf. Therefore you can do the famous polyglot persistence that is use the right data model for the right part of your application but you don't have to deploy multiple different data stores. You can still deploy different instances if you want but you can use the same database and if you have it in the same store you can actually combine the different data models in the same query. And in general you won't have a giant database in the middle of your project but you will have multiple deployments for the various corners and for the various microservices. Anyway there's important use cases and it's actually surprising how many use cases there are which really benefit from the fact that you can have multiple data models in the same store at the same time. Now the point is with these multimodal databases and that's actually a trend because more and more data stores which used to be specialized with respect to their data model broaden their view and add multiple data models. So many relational databases add JSON data and indexing for it. Graph databases more and more add document facilities and secondary indexes. Document stores add graph facilities and so on. So a lot of things are happening and what essentially distinguishes us around would be is that we have done this combination of the three data models from the start and therefore we have a unique language which allows powerful communion queries. It's called AQL and it's similar to SQL but tailored towards this combination of data models. So JSON is very natural in it and the expression of mixed queries between different data models is very comfortable. There's transactional semantics at least on a single server. You can do joins. So many document stores say, well, joins don't scale. We don't offer them at all. Well, okay, yes, sometimes there's difficulty scaling joins but some are working well and so therefore our language supports joins. Obviously you can do graph queries. So what I started with, the graph database aspect is fully there. It is a graph database just with additional capabilities and AQL is independent from what language driver you use and has some support for protection against the infamous SQL injections. I should point out it is not SQL because we figured that SQL is good for tables, good for relations but would have to be extended beyond recognition to actually support graphs and graph traversals and these special queries well. So I've talked about single server and cluster. These days one should mention cluster orchestration systems like DCOS or Kubernetes and really the days where it should be where it is complicated to deploy a distributed application should really be over. What you want is that you have an operating system for your cluster of machines or your cloud instances and that you can deploy a distributed database or distributed application as easily as you deploy an app on your phone. And that's actually the case and these orchestration systems be that DCOS or MISOS or Kubernetes or Docker swarm. They really allow this. They manage your resources for you similar to how the operating system of a laptop manages the memory and disk and so on for you for multiple processes. You get significantly better usage and resource utilization because the system deploys stuff in the corner of your cluster which is still free. And you get fault tolerance, self-healing and automatic failover as an aside. Obviously the application has to be built for that but if you scale out and have a distributed system there's no point in not making it fault tolerant and self-healing because in a large installations things will go wrong essentially every night. So currently ArangoDB can deploy it on just cloud instances or bare metal or whatever and you can easily deploy it to Mesosphere and DCOS. You also can deploy it to Apache Mesos clusters and you can deploy it to Kubernetes and we are working to improving these integrations continuously. Now I'm claiming such a system like DCOS is exactly the right thing to play the game I announced in the beginning. I was claiming you already have graph data and now I'm saying you already have multimodal data. Now, analytic engines like Spark are good to do the big analysis and so on and streaming and everything but if you need these small ad hoc queries be them graph queries or other multimodal queries then it's probably good to deploy a database for that and extract a part of your data or all of your data into that to be able to run these little ad hoc queries with very low latency. And such an orchestration system is the perfect environment for that. You deploy yourself here. You don't only deploy your database you deploy Hadoop, you deploy your other applications you deploy Spark and get service discovery, resource management, good utilization and so on. You get this essentially for free and you can play these games and you can plug things together. So, what I claim, you can do things like HDFS, Spark, ArangoDB just the same Mesos cluster, Kubernetes cluster and start playing. And just to make this point a bit more explicit let me just show you a few examples. This is not live but these are simple commands. So for example, the first two lines would be a way to get out of your HDFS two files as comma separated values which might describe, so this example describes patterns all patterns since the 1950s in this database and there are citations between these patterns. So there would be an edge in this citations file if a patent cites another patent. So that gives you your data. The third line is just an example how you deploy a distributed ArangoDB cluster in the DCS environment. You just say package install blah. You get a minimal installation which you then later can scale up to your needs just via the user interface. So the third command and I believe something is cut off. I know it's only in my monitor here it's not cut off. It's just two commands to import the data. So my point is really in such an environment with these tools you just deploy them with a single command like you do click on your phone. You block them together, take the data there move the data over there and five minutes later maybe 10 minutes later you are in a position to actually execute ad hoc queries on your data. Now let's look at this for example. Just to give you a feeling what kind of graph queries you could do in an ad hoc fashion once you have done that. So this for example is the Arango query language. It essentially tells start at the patent with number 6009503 use the graph G which we just imported just follow the edges in the outbound direction and give me everything I can reach in one, two, three steps. So here you go we start with one patent and we tell the database do a graph traversal but only go one up to three steps so I don't care and give me all the patents I find. So I get all the patents I find which are cited by that one. I get the ones which are cited in two steps and three steps and here you go. Now this for example gave me 500 results and the execution time is 317 milliseconds. So that's a bearable latency. Obviously you can execute multiple of these at the same time for some application and that is what I mean decreased latency for ad hoc queries. Imagine a browser which creates a web page where you can say here's a patent give me everything it cites up to three steps or something like that. This example it's also interesting it tells you how to do a pattern matching. Here I'm interested in other patterns that cite those which this one here cites. So if I am interested in similar patterns then I probably want to see what kind of citations does it have and then go backwards what sites these as well. So how do I do this? The first line down here says go one step outwards that is citing and from there one step backwards. That's exactly what I do. Then I filter out the original one because obviously one step forwards and backwards gives me always the same result and then I return these and this executes in 59 milliseconds. So that's quick. So maybe two more. This is find all the patterns that cite this one that is backwards in up to two steps and it's just using the inbound direction over there and just gives me the keys of these patterns. And the last one I might be interested in everything which is related to this patent. So everything which over time cited that one. So I take an old one and see what somehow arose from that. And in this case I just say up to 10 steps and go backwards and just count the number of results. So I've not got many examples because I'm also shortly out of time but the point I want to make towards multimodal here is you do have this language here. Once you have imported your data to ErangoDB you have this language you can do your graph queries but in the same query you can combine this with other queries. Say for example you look for patterns and then you notice I need some data joined to each result. Well then you just continue in this query and do your join operation right in the same query. That is the strength of the multimodal idea when combined with the graph business here. To just finish off I've made the point that you have your data already sitting there and maybe it doesn't change that much. You might be able to look into a subset so you extract some data or maybe everything into a database and have then the possibilities to do ad hoc queries. Sometimes you can take this one step further. You can say okay I do want to do Spark analysis but maybe I want to have the primary copy of my data in a transactional system where I can apply changes in a transactional way and keep that copy there. Then you can just turn around the arrows and say you have the graph database or the multimodal database or whatever as your primary data store. You keep your graph and maintain it there and regularly dump out to HDFS. In this environment of a cluster operation system, rating system, you have this option as well. So you might dump every two hours or every night to HDFS and then immediately run your Spark job on there and you have achieved the same result. You can do quick and big analysis and quick, short ad hoc queries and you have a lot of possibilities. ArangoDB has a Spark connector so you can essentially even run, directly run Spark jobs on top of the data in ArangoDB. But that's taking, that's essentially the end of my presentation. I've collected a few links in case you want to follow up and thank you very much. I'm happy to answer some questions. And I should say I have put a few Arango t-shirts and stickers down here. Please help yourself if you want. Here's a question. Do we have a microphone? Otherwise I just repeat. No time for questions? There's no time for questions. Please see me just after the talk. I'm still here. Thanks.