 My name is Ashwin. I am a co-founder of a small company called New Hygiene. And we work mostly in the customer engagement. We are going to be talking about graphs. So I'll take a quick survey before we start off. How many of you guys here use a graph database, you know, in day-to-day programming or working, say Neo4j or Orient or Infinite Graph or TEX or something. So I see one hand, right? And how many of you guys use something like MySQL or Caching Layer, say Memcached, EScached or whatever. Everybody, almost everyone. So what happens usually in everyday programming is that not many people figure out that they need a graph model to program. Why that happens is because usually if you're building an application, say a web application of sorts, and you don't really discover graphs in your data. Now, how many times has it happened that you actually started off with a technology stack, say some kind of a database or some kind of a NoSQL database, you know, and with a Caching Layer and any kind of, you know, Ruby on Rails or whatever is in fashion. And you realize there's a graph in your data. Has it ever happened in a show of hands company? So, again, very few people. So my talk is about building, analyzing and visualizing a class graph. I have crossed out building and analyzing because I didn't know that this is going to be for half an hour. So, since this is a visualization track, I will talk mostly about visualizations. But if there's any doubts on how do we do things like storing graphs in, you know, regular databases and Caching Layers, maybe stored on Redis or things like that, you can always ask me. I'll be around after some time and then we can talk about it. So we'll talk about visualizing class graphs. Now, how many of you guys attended the talk today morning about visualizing text? So there was a visualization where he would click around and things would pop up. So that was an example of a graph visualization. Most people don't see it. For example, if you have, in any large company, you have organizational hierarchies. You have family trees. You have graphs of varying sizes, sometimes small, sometimes just big. Generally, the description you would see or the visualization you see on the poster or the sticker which Neo4j has given you guys for it, it looks something like this. Agreed? There are many ways of visualizing graphs and there are many techniques. So I'll start with the most basic things which you need to know. This is not a Neo4j talk or a tool chain talk. It is more geared towards people who would want to build this themselves or start to think about building it themselves. Because most of the times the available toolkits are not what you would need. Probably you want to embed it in your website. You want something to do with D3.js. You want to build something on Raphael. Or you want to do something to build a graph on Neo4j using Python. So this is more of a talk on how to build these things. But if you want to talk about some of the frameworks which are there and talk about it later, we can always talk about it. The first question which you would ask is why do you want to visualize a graph? For me, the most important thing is that it reinforces your cognition. So you look at data in a certain way. If I tell you about your family tree, you're looking at it in a certain way. Unless you see the classic hierarchical structure of dad, sons, sons, grandsons, you don't really get what a family tree is like. So if you want to know what is the tree part in a family tree, you have to look at that visualization. That's why we are so familiar because families are always shown as trees. If you want to understand relationships better, it's always easier as humans for us to understand things visually. If I give you a matrix where I say flight A connects to flight B, London to Calcutta and a matrix of say a thousand flights, you would not make much sense out of it. But the most popular visualization you would see is a globe and all these airlines connecting there. You want to reveal hidden attributes, otherwise they are very difficult to discover themselves. You look at a matrix, you look at a list, you look at numbers, you look at a database table, you look at a new SQL table, whatever you want to look at, you will not find attributes which are easy to spot when data is visualized. And of course it's easier to explore data. And of course you can create really cool looking posters. So you can go around and you can take a data, you know, you can take a printout of your awesome visualization and put it on the front of your company and say we work on this. So you can do some really good things. Now some of this was covered by the speaker on Mayavi. He was talking about graphics primitives, agreed. Now you take a graphics library, most of the time you would have to draw circles, you have to draw lines, you have to set a background, set a scene, right? Today morning how many of you guys attended the talk on processing? The speaker must have covered the fact that you have to set up a background and you have to set up scenes, you have to set up circles, ellipses, draw lines between them. Now our job to look at data and try to visualize it, look at a graph and try to visualize it, is to abstract that layer. You will not think of circles, spheres, ellipses, lines, basers, you will not look at that. What you want to do is to look at how to draw something, how to present an algorithm which in turn can do draw circles, draw lines on some graphics library. So it doesn't matter. So again coming back, you can use whatever visual library you want. This is more about how do you build the algorithms behind. And you want to simulate in a reasonable amount of time. So if you say that I want to visualize your graph and tell the guy to come back after six hours, he's not going to come back. He's going to go to sleep. He's going to go away. So you want to do it in a reasonable amount of time. Maybe you want to do it on your laptop. You can run it off a cluster of machines, but you would expect some reasonable result. Some of the applications which is used, graph visualizations are used in exploratory data analysis. The best case you saw today morning, the speaker was clicking on notes and things were open. That is called exploratory data analysis. Social network analysis. Facebook has 900 million users. Many of them connected. Average user has around 200 friends. How do you analyze something like this? Cartography for maps, for geographical information, aircraft routes, things like that. Of course bioinformatics. Organizational relationships, many more. So graphs occur all the time in your data. If you look at it, you'll find it. If I ask you to build a relational model for something like an organizational hierarchy where 10 people report to one guy. He in-term reports to the other guy. He probably reports to two more people and he in-term reports to the CEO. It's a little difficult. It's difficult to think about. It's difficult to analyze. It's difficult to program. It's difficult to query. Where when you think about it as a graph or a tree, it makes life much more easier. It's very easy to think about it. You can actually draw it. You don't have to think much. The challenge is when it becomes a little big so that it's a little difficult to draw. So I'll go over very quickly some of the most basic concepts you need to know when you're talking about actually drawing a graph. Now this applies if you're drawing it by hand or you're drawing it using machine. A drawing is a pictorial representation of vertices and edges. Now vertices and edges are shown usually as dots and lines. That is the most accepted way of showing it. There are other ways but we'll not go over it right now. The crossing number is the lowest number of edge crosses. So if you are drawing a graph and I ask you to draw lines between four, if you cross the edges, that's the crossing number. The lowest number of edge crossings in a graph is known as crossing number. These are not very important. This will become very clear when we move away from here. The bounding box for a graph drawing is the smallest area in which all the points will line. So if you're looking at a 1024.768 screen and you want to draw a 100,000 node graph there, then all your graph points lie between 1024.768. That is the bounding box. Now I'll show you some simulations in which the bounding box is very clear and things which go out of the bounding box and inside the bounding box and how we deal with it. We'll talk about it a little later. And a planar drawing is a drawing that none of the edges intersect. So you would want to draw things where edges don't intersect. For example, you can draw a family tree where the grandfather is on top and the latest generation is below. You would try not to overlap. You could still draw it by overlapping but it's more confusing to the user. So when you talk about drawing a graph, in any case, you would want to reduce the number of edges to sections. You want to make them less because it looks like a mess. So any drawing you would see is usually the edges are separated and they're drawn wider apart. So how do we do that using a machine? A very important part of this and the whole point of doing a graph drawing is athletics. You want to show something pretty. You don't want to show a blob and say that this is your data. You want to show something nice which makes sense. So you have to look at symmetry. How does it look from all their directions? So suppose you look at, again, let me mark once again, avoid edge crossings. Try to reduce edges crossing over each other as much as possible. Try to have straight line edges because in a graph drawing, the edge usually conveys a meaning. They don't have to but they usually convey a meaning. So if I draw a graph of me, my friend and my brother and distant relative, I would try to put in that relationship in the length of it or the thickness of it. Thickness of the line itself, of the edge itself. So we avoid bends in edges because the longer you make it, it adds a specific meaning to it. So you try to avoid bends and try to convey it in a more textual order using the thickness of the length. But you try to avoid bends because it confuses the user. You try to keep edge lengths uniform so that if you are drawing a lot of nodes, you don't want some of them flying away. You don't want some of them sitting around. So you would try to kind of have similar length edges. And you want to distribute your vertices uniformly. So in a large bounding box, say a 10406-tier scheme, all your vertices lie in one corner and some two of them are in another corner. It conveys false information. It tries to say that there's a cluster. There might not be, there might be. It is not an accurate representation of what you want to draw. Unless you want a cluster, you don't want to show them together because that's what we do think. If I show you a graph, if some things are together, you would associate a degree of closeness to them saying maybe these guys associate it somehow. So try to avoid those things. So this is a generic graph model. So if you are programming and you find that there's a graph in your data, you could use this model. This is the most widely accepted graph model amongst the currently used graph TVs, DEX or Neo4j. This is called a property graph model. I'm sure you guys have heard of it. It's rather simple. Your nodes and your edges are essentially collections at key value pairs of things. And this doesn't make an assumption of the directionality of the graph. Or it doesn't make a strict rule on how it should be laid out. The only restriction being that these are key value pairs and they're connected by a key value pair. So I can say that me and my colleague, we are co-founders and we are connected by a link which is our common history of education and I can add attributes to it. I can add my name, I can add my company, my previous history. I can do the same to the edges. Now I can build a fairly large enough graph using this and then I can slice and dice it in whichever way possible. Now most of the graph databases which you guys must have used Neo4j and Ori and DEX, all of these follow this model. Again, this is a very generic model. So if you ever encounter a graph in your data, you could probably model it using this. Now if there are any questions related to how you would do this in actual programming, I would like to take this. You guys want to know how you actually draw this. So what you could do is you can take a programming language of your choice. Let's say Java or let's say Ruby. And you can model these as objects. Agreed? You can model these as objects. You can have some kind of linked list implementation of all nodes. You can have a linked list implementation or you can choose your policy. You can have an implementation of your edges. And then you can have interlinkings. Agreed? Now with this model you can get fairly large graphs even on your 6GB or 8GB RAM machine. You can go up to 200, 300,000. Assuming that you're not storing too much data. So you can't say that Ashwin is one node, class is Ashwin and then here is his photo in binary format. So I'm not talking about that. I'm talking about you can get fairly large enough graphs which you can experiment with on your systems using this. If you would want to actually persist it, you could look at the first thing in memory, you would want to persist it. You can look at things like Redis. You can have set implementations and data structure implementations. You can persist using those. You can go another level and persist it on any kind of no-escal system. I'd love to talk about it. We'll talk about it later. So how do we start visualizing this? So there are in most accepted model of the graph visualization where you see circles and you see lines connected to each other and a pretty looking model which is usually drawn using this technique. It's called a post-directive technique. Now I just want to know how many people have used it or know about this or have implemented it. So I'll just go over it. You essentially model your graph as a physical system with your edges as springs which pull each other and your nodes as electrically charged particles. So you associate some charge with it and you associate a spring to an edge and you model the simulation. Now to model this you can use anything you want. I will show you some simulations I wrote up using processing. We can have a look and we can talk about when does this model work. So the guys who have worked on this who know the post-layout model just want to know what do you think is the general running time of these algorithms? Or how many nodes can I draw on my box? Fairly. In a reasonable amount of time. So you did on a browser, there's one more thing. I'll come to it later. But if you look at the running time the way you model this is that you take every single node in your graph and you calculate your forces to every other graph, every other node and you do it for the springs. So if you're trying with n you're trying with n nodes you can do it at least twice. So at least the minimum you can do is n squared. You have to do it so many times. So if you have 100,000 you have to do it so many number of times even to get a decent layout. Partly one of the reasons when you try to do it on a restricted environment like a browser it crashes because it's not just the amount of time it takes to run but there are other considerations on a browser because you will not be allowed to run long scripts on a browser. Memories run out. You'll have a problem where something is malicious. You're trying to load say 50,000 points and your script all it does it goes over it again and again and the browser thinks that there's something fishy going on. So it throws you out. That's why most of these drawings don't work. Now there are a couple of simulations available. It's a very popular toolkit called D3JS. Have you guys used it? Most of the people have heard of it. It's fairly popular nowadays. They have a very beautiful post-layout algorithm but again that will again peak out at around 2000 nodes where it won't be able to give you a right picture of it. So, let me run through a couple of simulations and we'll see how the performance deteriorates and how the aesthetics of it deteriorates as the number of nodes. So what's good about post-based layouts it's generally nice to look at it. It looks like this. This is from D3JS. It's fairly beautiful. It's simple to implement. So you can try it out. There are algorithms available everywhere. You can use any kind of forces. You can come up with your force saying two bodies close together attract with a coefficient of 20. It doesn't matter. You can think of something and implement it. The most commonly used are Fulham's law and Fug's law. It's intuitive. When you look at it you know that this thing is joined to a lot of other things which is joined in turn to something. It's very intuitive. It's available commonly because there are a lot of ways to implement it. It's easily parallelizable. So a lot of ways we can parallelize it and then it can escape. But a parallelizable post-based layout is non-trivial to implement it. It's kind of difficult to do. The bad of course, high running time. The running time depends absolutely on the number of nodes you're trying to draw. You might start off with a small graph and try to lay it out and then when you try to do exploration when a new node comes in you have to do a number of nodes increase and the performance deteriorates. So you start off with say 10 nodes add another 10, add another 100 and your performance you will see a fairly large degradation in performance because it's an n squared value. So I try to do a simulation like you can see this. So this is a layout which has been computed. It's still running actually. Now for a small graph like this it's okay. It seems pretty intuitive but it's easy to see and what happens when you try to do it for a larger graph. I'll just run one which has around 2000 nodes and this is the most dumb algorithm I could come up with. So this is what starts happening. It is trying to compute it so many times that it actually can't lay it out and because of the large number of particles around it tends to ripple a lot more. So in the end you will see a large number of memory. That's okay but you will not be able to see anything allegedly. Suppose you want to do an exploration. You want to see how some words occur with other words. What you could do is you could take all the words in a book put them all together and dump it onto a graph with each thing occurring removing all the sharp parts of course and what occurs with what other word. Then you could try to explore. Lakshman. Lakshman occurs a lot with with Inderjit. Inderjit occurs a lot with Ravan. So a lot of places where these graphs can be used. And visualization like this helps because you can see you can see Ram, Lakshman, Inderjit, etc. There are hundreds of applications you can talk about. So what do you do when you have really large graphs and you post layout algorithms or more? What you can do is you can look at other branches where such visualizations are used. One of them is simulation. I'm not an expert at this but this is used essentially for really large scale computation like galaxy formations. There are thousands and thousands of actually millions of objects and you want to know how they interact in terms of gravity and things like that. It's orders of magnitude faster. So if you want to actually draw a large graph on a browser you should use one of these simulations rather than a brute force approach Now this is an image from a project called Millennium Run. It's a large n-body computation keeps running and then these emerging patterns appear. This looks like galaxies forming. It looks like a star, it looks like planetary bodies forming. So a bunch of other things. Now how do you implement an n-body algorithm which is meant to simulate galaxies on your computer, on graphs on things like employees friends and friends of friends and things like that. What you could do is you could consider every node as an object with a very large mass. So usually you would start with something like Person A is an object of mass 10,000 kg that's how you start off there. In the simulation not in reality. You group things which occur relatively close by as a single object. If you have something really far away then you consider it as you calculate its center of gravity and then consider as one large object and then you try to do a simulation. Now these are complications. This is how a center of mass of an object is given. X and Y total mass and things like that. I'll come to a simulation and you'll understand this better. Okay, so there are it's really complicated to build it initially but the performance gains are really big so you should really look into it. You should use a quadtree or an octree to insert your nodes. This is the fairly standard algorithm you can find it online. A quadtree is a binary delete which has 4 children. The goal is to divide your space in the sense your physical space on your browser or on your layout engine into a part, into squares into regions where each region has only one node. I'll show a simulation. I will show how an n-body simulation looks like and how do you make a graph simulation out of the n-body simulation. This is how a general n-body simulation would look like. You would see all these things flying around and this is the these are the boxes I tried to talk about where you want to put it. Now this is my bonding box and you can see things are getting out of it. So your goal is to in the end have one large box with one pointer and everything else would have already escaped. So you run the simulation. Now this doesn't look like a graph simulation. It looks like things like things just bubbling around, dots which are going around. So you want to make this into a graph simulation. Now to make this into a graph simulation you do something very simple. You simply remove the boxes and put lines between them. So it becomes a graph simulation. So this is a fairly large graph which is the exact same thing which the force layout algorithm had such a long time drawing. Now using this method it's able to draw it. Now things are running out of the bonding boxes which is they are okay if you look at it you zoom through it you will see a fairly large spreader on this graph. These are the individual nodes. The only difference between this and the last simulation is that the boxes have been driven and the lines have been added. Now you can annotate them. You can say that this is a friend this is a guy, this is a person and if you look at it from a distance you will see a fairly nicely laid out graph. So my time is over. Some of the things you can look at, you don't need to do most of these things yourself if you are doing it for experimentation. You can look at Geppi, it's a fairly powerful tool to all these things.