 Zato vseč jemem v programu, bo je nekaj nekaj, da je vseč demo, ali bo tega, da jim se vseč tega. Zato imam tudi vseč tega, da je vseč tega, in jim se vseč tega in jim se vseč tega vseč tega. Zato da imamo preparacije, da bo se vseč tega vseč tega, zato smo vsevno vsevno vsevno, in smo počučili in vsevno vsevno vsevne, vsevno vsevno vsevno. Zato nekaj. Vsem nekaj da se počučila in sem se daj se izgleda, ali nekaj sem si tega dobro. Zelo, da smo počučili, da se zelo vsevno vsevno vsevno vsevno vsevno. Prejzela. Už sem da bo se pravno od vzela, v kjer smo bovali vse, ki je zelo počutiti o kompani, o moj počutji. Vsi so počutiti, pa je treba našča izgleda različnega vzela. Če moraš da izgleda? Ako nadal ti to vzela? Tudi, tako, tudi, tudi od tega, nači doboj, načo je doga. Od nekso, da je do deka hedron. And I help you that name of do deka hedron, comes from do deka, which means 12. So it has 12 regular pentagons, which build its spheres, its sheets. Bez vse vse, vse, začal je, da vam da je se vse izgleda, da smo. Mi, da je vse na vse, da je to, da se učal, da ga se vse začal. Tako, da sem gleda na vse, da sem gleda na vse. Mi, da ne da ste začal, da ne. Tako, da ne, da sem ne začal, da so ne začal, da se zakaj da ste začal. Pa začal se začal in nama, da se začal, da je to začal. To je počet. Zelo si. 1, 2, 3... ... ... in zelo sem. To ne zelo. Zvukam, da sem vsega. Vsega. Vsega. Tukaj. Vsega, da sem vsega, da sem vsega, da sem vsega, da sem vsega. Vsega, da sem vsega, da sem vsega, da sem vsega. ... ... ... ... ... ... ... ... ... ... ... ... ... ... spaghetti. I drive. I drive. And I'm going to get what remains is going to be the space diagonals. So I have 20. I know that from graph theory that 20 edges, sorry, 20 nodes, 20 vertices with each other can be connected by n times n minus 1 divided by 2. tako nezpečenje, da je 1,90 totali dobročenje s druh vrštih, in zato, ko so nekaj različni, je zelo svoj. Zagrežujemo si 12 večen jah, vse je pentagon, tako 5, 5, 12, ali svev sveve priježil je na samim večenju. Tudi dobročen imam 60 v 3, 60 v 2, tako je 30, Zelo je 30, 190, minus 30 je 160, in pentagon je 1, 2, 3, 4, 5, nekaj ga je zelo v n minus, n times n minus 1 divided by two equation, the body diagonals of this one, so that would be 5, so 5 times 12 is 60, 160 minus 60 is 100, and we got the right answer. It's really very, very fast if we use graph theory, but you don't necessarily have to use graph theory to answer this question. And there are lots of problems in the mathematics, lots of problems in business, lots of problems in physics, and mathematics is the language to try to formulate, to formalize the problems. Graph theory is one kind of language, the language that I was speaking, when I was speaking the graph theory language, is vertices, edges, full connected or completely connected graph, these were the language that I was using, and this language is coming from this graph theory thing, but you can use other languages, you can solve things with other methods, probably the most popular would be calculus, if you want to use, if you want to solve something with machine learning, the first thing that you have to learn is calculus, and how to do optimization, but graph theory is also very useful, just we saw that with the help of graph theory we just solved this problem like that fast, and probably with geometry, complex geometry or with set theory, to answer this question would be probably a little bit more complicated, or with algebra would be a little bit more complicated, and would take probably a little bit more. By the way, did you use the same idea? Not quite the same, but the same item. Ok, and then should come a long, long presentation about graph theory, the history, and what are the definitions, who invented what, how they were using, what kind of different applications we know where graph theory is very useful, but we don't have the time for that. So if you're really, really, really interested, I can suggest you an amazingly good book in the network science by Albert Laszlo Barabasi, by the way he's also Hungarian, like myself, and there are a couple of other Hungarians in graph theory, and probably this is because of the language. So when I go to a client, and I start to tell that, well I do graph theory, and I do graphs, and I solve problems with the help of graphs, everybody thinks that I'm an expert in Excel, and I can do beautiful charts. And with that beautiful charts it is possible to solve all the problems in the world. So graphs are synonyms for charts, but it has another meaning, the mathematical graphs means the structure which is built up from nodes and the edges. In the Hungarian language we have two words, and if I say graph, I don't think about chart, nobody thinks about chart. If I say graph everybody knows, well let's say 90%, but nobody knows that I'm talking about a mathematical concept which is about graph theory. And if I want to say chart, I'm using a different word. So very early age Hungarians learn that graphs are not charts, oh what are graphs and then they start to learn about graphs and then they become scientists. So I really recommend you this book, it's an amazing book, there are a lot of nice videos and the whole story about how the whole thing has started, everything which is, maybe I'm going to show you this video. Let's just check the network speed. Ok, so I think we are going to have some small problem with the network. I'm just testing this because we're also going to access our tool through the network. So if you experience slow things that might be because of the network, let's just stop it. And let's go back here. So I'm not going to tell you the graph story, I'm not going to tell you all the definitions and everything but still I will tell a couple of things which is necessary for moving forward for our goals. So what are our goals today? What do we want, what do I want to achieve with you? Well my main goal and maybe your goal is the same. My main goal would be to introduce you LinkSkyte, which is a graph analytics tool, which is a big graph analytics tool which my company was preparing for data scientists and data engineers in order to solve those kind of problems which can be formulated in this graph language. Typically for telecom sector, for mobiles or figs line or advertising, for banks, for insurance, but you can use the same concept and the same tool for biology, for space research and other problems. So what we wanted to build, what we were achieving with this LinkSkyte is to build something which is super scalable because big graphs might be problematic to work with very large graphs. So we built our tool on the top of Hadoop with the help of Apache Spark and the programming language was Scala, all the algorithms were built in Scala, but we will not see anything or not see too much of those things today. I hope I will not make any mistakes. So you are just going to see a nice web UI with which you can access the tool and we are going to log in to I think AWS and we have about 12 nodes which is going to serve you and hopefully we are not going to have too much latency on that. I've never tried with so many people the tool in the same time. So that is number one, just to find out how nice is it to work with the graph analytics tool and how easy and how can we do some research with that. So I hope everybody has their laptop. You might also need a mouse. If you have mouse, I would recommend to use it. It's easier with the mouse and internet connection, obviously. So the internet connection was shared earlier. You all have internet connection, right? OK. My second goal is put there the password. OK, so my second objective is to teach you a little bit how nice graphs are from how many different data sources you can build up graphs. It doesn't necessarily have to be something which is obviously a graph like LinkedIn, for example. It can be something which is much different than that and still it might be useful to analyze it as a graph. My third objective is to teach you a couple of descriptive analytics for graphs. So learn that what are the most important properties of a graph, what you should check if you have the time and ability to check those properties and if you know those properties, you know much more about the problem or the setup of the problem what you try to solve. And also this is going to help us to better understand why big graph problems are big data problems. Because when you typically hear about any questions? OK, so because your typical big data problems you never hear this graph story, right? You hear natural language processing and image processing and videos and text-to-speech and translation and self-driving cars. This is typically big data problems. Nobody talks about graphs and graph theory. So is it really a big data problem or is it not? I would like to talk a little bit about that. And we are going to do a little bit of predictions with the help of graphs so I would like to show you that if your problem can be formulated as graphs then sometimes you can build better predictions than if you don't use this graph setup. And what we are now, what LinkSanatic is right now researching is like completely changing the way of neural network machine learnings and our idea is, actually it's not our idea, there are a couple of articles in the literature but it's not very popular yet. So the idea is that we are going to put in every single vertex an independent neural network which is able to learn from the neighbors, from the connections from the neighbors, neighbors and so on and so on. We are not going to do that today. We don't have an implementation yet in LinkSkite but maybe if we meet next year we would be able to do something like that too. Ok, so still I have to define a little bit what is graph and what is graph theory. So the most important elements of the graphs are the vertices and the edges. Both are sets. The vertex set would be set of nodes. It can be numbers, like v and w and all these things are going to be called as vertices. And I am going to call it vertices or nodes or just points. These are all synonyms for me. And I have pairs of vertices which is defining an edge. So v, w is an edge if v and w are vertices in the vertex set. And if I list the set of vertices and I list the set of edges then I got the full graph. I can represent the graph. I have a couple of names. I usually measure how many vertices all together I have. It should be a finite number most of the cases and I am going to note it with small n and I need the number of edges. I usually note it with large m. And again, there are different names for it. I am probably going to forget that I initially wanted to call them rank and size but we always understand what we are talking about. And there is also one thing what I am going to call as degree. So every single point can be calculated how many edges are connected to that point, that vertex and the average of that number is going to be the degree of the graph. So typically, and I am going to define it a little bit later but typically when you see a graph the average degree is going to be calculated by this formula 2m divided by n and later on you will realize that I used the same formula when I was solving the doda kahedron problem. It's not always true but let's start with that function. If a graph is so simple you don't need anything just vertices and edges then why are they interesting? And the reason why they are interesting is because if you tell me these two numbers n and m and they are large enough then I can draw millions or billions or potentially almost infinite number of different topologies with just these two things the number of the edges and the number of the vertices. For example let's just have two very simple examples. This graph I would call complete graph or full graph. This was the graph what we used for solving the doda kahedron problem and of course if you tell me that the graph is a full graph there is only one way to be a full graph. Again it's not 100% true I'm going to define it later on but if you just take a look this is the way to create a complete graph and then every node is connected with every other node so you can calculate the degree which is like 1, 2, 3, 4 which makes sense and minus 1 and basically that's it but we can also build up a graph like that which is a star or I can also call it as a tree which contains the minimum number of the edges so with only four edges we can connect these five vertices so it seems that the minimum edges which is needed is n-1 the vertices is number minus 1 in order to still have a connected connected graph but the thing is that in between these two extreme structures there are lots of lots of other types, topologies of graphs and of course if it's only five vertices then it's not so many but if it would be 50 it would be a lot like thousands and if it would be like 1 million vertices and let's say 10 million edges then it could be and super high number of which contains all the possible types of topologies connecting these them and they are structurally different they would be structurally very, very, very different and when I'm talking about structurally different I'm really talking about only the graph sets like the list of edges and list of vertices because how I'm going to for example visualize it, for me these two graphs are the same I have the same vertices but I was visualizing the two graphs, the same graph different ways the two graphs are, I call it isomorphic because they they look exactly the same if I list the vertex set and if I list the edge set, there would be no difference and again I'm cheating a little bit, there are even more isomorphism in graphs but let's just not go into that very detail I'm also going to use, I already used a lot of simplifications because there are many different types of graphs what we can observe and these are these five things which I think is important to tell so first, what does it mean directed, undirected graph so if in the edge set it is important to mention that VW is a different type edge than WV from V vertex to W edge but if I'm talking about an undirected graph by definition I assume that it doesn't matter the direction doesn't matter so if VW is there, then WV is also there I'm not going to count them separately weighted graphs if the edges can have attributes it can be a scalar or it can be also a vector attribute of the edges I have to mention so linkskite are tool what you are going to use cannot treat undirected graphs for linkskite every graph is a directed graph and it is going to it can cause sometimes small difficulties but don't worry about that now weighted linkskite can easily handle weighted graphs so attributes on the edges for example is about a telephone network I call someone so that is an edge in between us and the duration of the call could be the weight of the edge we call a graph simple if there are no loops so I'm not calling myself and there are no multiple edges multiple edges if I call my wife three times the edge but when a graph is simple I just merge these three overlapping edges parallel edges and I say that there is just one edge in between us but of course I can weight it and I can say that the weight of this edge is three because we called each other three times typically we are analyzing sparse graphs and that is going to be the real interesting problem here so a graph is sparse and the graphs are much much much less than the number of the potential total number of the edges and of course when I mention the number of potential number of edges and I usually think about undirected simple graph because if the graph is not simple then there can be multiple edges between two vertices then basically there is no upper limit for the edge and again when I say here the maximum edge only if our graph is undirected and simple so the spare property is usually connected with undirected and simple graphs and there is one more thing what we have to learn this is the connectedness of the graph so I call a graph connected if any vertex can be reached in any other random vertices so this is a connected graph and when I say minimum edges then of course only if we are talking about connected graphs because if I can talk about not connected graphs then I can easily say that oh there is no connection between three and one and then poor three is going to be a single one component and one, two, four, five is going to be another component this components thing number of components or the connectedness is one thing that a lot of scientists are researching and it is a very interesting question what is the phase transition when a graph is going to switch from non connected to connected one and of course we can think we might think that if a graph is sparse then then it is not going to be connected but also if we think a little bit again with this example here that the minimum number of edges which is enough is n minus one so this is linear in n this one is quadratic in n for a very large n the difference between the two things are going to be huge and for assuring that the graph is going to be connected because you build up a tree or you build up a star but of course it is a very special topology a tree or a star is a very, very, very special topology so we know a lot of things about the graph other key pi is what we are going to use today is degree or scale distribution so this is nothing I told you what is the degree every vertex can count how many edges they have so I can build distribution on that how many edges have zero edge how many vertices have one edge how many vertices have to and so on and so on and I can build a beautiful histogram and distribution on that so that distribution is super exciting how that looks like we are going to check it in my example diameter and characteristic length another very interesting thing diameter is the longest shortest pass in the graph so every vertex can be found in just one direction so the diameter here would be one the diameter here like this is a pass this is from 2 to 4 this is a pass so then here the diameter would be 2 and of course from 2 to 4 here you can also go in 2 ways to 1 but this is not the shortest pass the shortest would be the direct one so that is the diameter and a little bit more complex than the diameter would be the characteristic length so we calculate all the shortest pass from every vertex to every other vertex and we just average it that would be the characteristic length so for example if you meet a friend sorry if you meet a stranger and you start to talk with him you will realize that you have a common friend then you shout out that wow the word is small basically we are talking about that the word has a very small characteristic length because of that and I don't know if I mention here 6 degree of separation but the 6 degree of separation is the same story that basically any random people on the word can be reached to any other random people in the word in just and shakes between friends that is the idea of the 6 degree of separation you can learn more about that in the book connected components I was already talking about the connectedness but if the graph is not connected then basically the diameter and the characteristic length cannot be calculated that would be infinite because any random vertex from any other random vertex not necessarily can be connected so in that case I can calculate these parameters for every separate connected component every small component of the graph and the clustering or the concentration coefficient is something about how likely is that my friends are also the friends of each other like we usually say that my friends of my friends are also my friend friends of your friends is also my friend whatever so that means that somehow these groups and if there are people who are knowing each other separately then it's very likely that any random of them selected are also knowing each other our word is something like that like our friends are also usually friends of each other our separate friends are also usually friends of each other our word is very clustered it seems that our word is very very strongly clustered and there are some some evidences of that ok, so let me define three super exciting type of graphs and it works to define those graphs because I told you that the interesting thing is the topology and the way how I am going to create graphs with different rules with different mathematics behind or with different algorithms behind is going to create completely different topologies and these completely different topologies that are behaving very differently if we put them into some kind of a business problem, mathematical problem so the first type of graph that I am going to define is the lattice graph we can define n-dimensional lattice graphs here on the example this is a one-dimensional lattice graph with so the dimension is one the number of the number of vertices is 8 yes it's 8 and the k parameter the parameter is 4 because every single vertex is connected with 4 other vertices it's very regular, it's beautiful it can be much larger if the n would be not 8 but 16 or 256 the idea was that this chain would be the same or if I defined a two-dimensional graph a two-dimensional lattice graph with 1, 2, 3, 4 same the degree of 4 with a lot of vertices it would be just like a grid it would look like a grid two-dimensional, beautiful grid three-dimensional would be a three-dimensional thing, cube or something like that so it's very regular, it's very nice because it's very regular and very nice it is very easy to calculate its parameters for example we know that for a one-dimensional lattice graph the characteristic lengths would look like this equation and the clustering coefficient would like this equation we have another type of graphs which are the random graphs and the random graphs were studied in the 1960s, 1970s very heavily and a lot of beautiful theorem was found so the random graph again don't tell anything else you just tell how many vertex do you want and how many edge do you want and then the algorithm randomly going to throw out the edges across all these vertices and the beauty is that if it's like it's real random then every time it throws out these edges somehow the topology somehow the behavior of the graph is going to be same one beautiful theorem, really nice you will not find it in the book but what beautiful thing is that it's really shocking that relatively small number of edges are going to be enough to make random graph connected again we can see that the rank here is linear is logarithmic scaled the degree is logarithmic scaled in the order of in the function of the number of vertices so for random graph we can also very easily calculate these numbers the characteristic length for random graph would be logarithm n divided by logarithm k and clustering coefficient would be k divided by n oh this is beautiful so let's have a lot of experiments and when we do a lot of experiments we would realize that typically in this very regular lattice graphs the characteristic lengths are large imagine a huge chain or a huge grid to take just two random guys the probability that you have to go on many many many ways is relatively high the probability that your pass length is quite likely our word is not like that our word is a small word and we have to get people close to us relatively close in six steps in this one in this chain if the chain is really long you might have to go like anyway you see order there a lot of steps to reach any other random place but the clustering coefficient is also quite large and in that sense these graphs are similar to our word because we also know our friends our friends are knowing each other you can see that this is a small cluster here almost everyone knows all the friends so these small communities are appearing and they don't know those ones who their friends also don't know so the cluster, the nest of this very regular graph is high so on one part it looks like our word for the random graph the characteristic length is typically very small if you create a random graph because of this randomness any two selected random points are not too far from each other because you are going to find some ways of course the graph has to be connected but we already see that it's almost always connected if k is larger than logarithm n so in that sense it looks like that our word is more like a random graph than a lattice graph but the problem with the random graph is that k minus n you remember the spares we like we want our graphs to be sparse so that k compared to n is going to be relatively small so k divided by n is going to be small so the problem with the random graph is that if you take a random graph and if you select one point their friends are not going to be the friends of the others the probabilities are diminishing in that sense and then came the great idea these two guys, Watson-Strogatz and they said that is it possible to construct a graph which is having the concentration property from the lattice graph and the short characteristic length from the random graph and what they did was very very I think straightforward they took a lattice graph and they started to change the edges randomly and they set the parameters where they were able to reach out to a graph which was like that and then they said that this is a small word graph Watson-Strogatz small word graph and they said that our word is like that and every example for graphs like the calling circles of a network operator the internet where the pages are connected to each other with hyperlinks Facebook or LinkedIn which is a social network all these graphs, all these structures all these topologies are what Strogatz type of small word graphs because the characteristics length is small, but the clustering coefficient is high and they did a lot of measures do I have that slide? No, I don't have that slide but in the book you would find that there are a lot of measures which are showing that this is the situation that our word and Facebook and the core art or ship graph is having this smaller property and then came the guy who wrote the book Barabasi Albert and even there is one more property the scale-free-ness of the graph which is important and unfortunately the Watson-Strogatz graphs are not scale-free not necessarily scale-free so he used another algorithm but that's too long story I'm not going to go into that if you want you can read the book OK, so that is enough for theory and now let's do the practical thing so I would like all of you to access that website edukite.linksanalytics.com I am really begging that it works you should get this screen edukite edukite.linksanalytics.com like the name of the tool is linkskite, the company name is linksanalytics and the education version for linkskite is edukite works? yeah, cool so, I created let me see yeah, I created logins for you and I have to share those logins with you so let's see let's see if we have that OK so, you are going to be A, B, C D, E, F G, H, I J, K, L M, N O, P Q, R S, T, U V so, your login name is going to be student dash, caps, your number and because I'm really afraid that if more users go into login then it is going to completely crash the system and I'm not going to tell the password for the live stream but you can check it from that thing the first word is there for letters it should work let's go back to the slides OK, so what are we going to do we are going to analyze a social network we are going to analyze Facebook data not a very big one a relatively small one but still big enough to understand the main concepts and the way I was generating this data is I have myself, my Facebook ID and I have downloaded every other profile who are my friends and I asked permission for that and then I also downloaded the edges between them so if they are friends with each other we call that type of sampling ego network this is my ego network this is my first degree ego network in Facebook and of course this means that my degree, my number of friends are going to be complete I have everyone there but my friends degree is not going to be complete because they might have other friends who are the friends of them so that is going to be the network what we are going to analyze and we are having a nice actually we have a couple of nice questions to be answered but the first question is that why do I use Facebook I mean I really like Facebook for showing the graph problem but why Facebook is a good example the first reason is because Facebook is the most beautiful place to learn graph theory there are lots of vertices lots of edges lots of attributes on both of the edges both on the vertices so if you work for Facebook and if you are a graph scientist you are going to do the most beautiful thing what a graph scientist could imagine for him or herself this is a very good article which is trying to identify the romantic partnership in Facebook just using the topology just using friends and friends friends who are connected to each other and it turns out that with this graph method with this topology method it can be identified with a higher likelihood who is your romantic partner then if you do text analysis or image analysis on Facebook picture analysis on Facebook there are a lot of other articles about how different nations are connected to other nations in Facebook how groups are connected how communities are formulating how can we identify fake profiles so there are lots of interesting articles about graph theory and Facebook so that's why I think it's a lovely place to learn and to get new algorithms and new ideas from but the other reason is that I strongly believe that almost every other industry can build up their own social network data. Think about the telco company think about SinkTel all your data is there at SinkTel if you call anyone this data is recorded they can build up their own graphs or think about DBS DBS measures or sees what transaction do you do who do you have lunch together with and then you split the bills they know all of this information they can also build up their social network and of course this social network can be better or worse than Facebook bigger or smaller than Facebook depending on how much data you have but the beauty is and I just measured it many many times when I had the chances that there is a very beautiful overlap between those things so if you have Singapore's Facebook as one graph and you have the calling circle from SinkTel and you just put the two things on the top of each other you would see that there is a huge overlap between those things between those structures so if Facebook is able to find romantic partnership in Facebook data then quite likely DBS would be also able to find romantic partnership in DBS data different different probability different number of training sample but this is basically the concept so this is going to be what we are going to do we have Facebook data and we have these six tasks to do I think these six tasks would take about 12 hours do we have time but I'm not going to take away the login credentials from you so if you want to continue or if anyone on the web would like to do something and do some analysis by themselves I can provide you more access more time for experimenting and basically these are the tasks what we are going to try to do step by step I think we are going to arrive to the romantic partnership today and the rest can be homework but all these things are possible so it is possible to find homogeneous group it's possible to build up models for predicting who speak different languages it's possible to because finding my romantic partnership or my father or my sister is relatively easy because I have definitely all of them in the network but my friends and my friends wife is not necessarily in the network and they definitely have much less edges between them than I have with my wife there so that would be much more difficult but still that is possible to do with the ego network so yeah all of these things are possible and I think that's it so let's go and let's really do the thing so everybody should be able to go to the graph meetup and everybody should be able to access this sheet this project actually what we have in linkskite we are having data and we are having project so the data I already uploaded and I created a project where we can check this uploaded data so who doesn't have this I know that you don't have and you are W you are E ok so what you have to do is you go here because this you can read but you cannot write so you come here and you say that I want to save it at another folder and instead of the graph meetup folder you have to put into and I forgot that's great I forgot the syntax it's users and then the thing ok users so you come here and save it to users dash student E right student E but all others you save it to your own folder and then you can keep the rest of the name or you can give a different name whatever you wish and just click to yes and it should be now in your folder it's not your folder just do the trade folder anyone else go out go to student E yes you have it cool you can access it just with a click everyone has it ok so now you can already edit the file and yeah I have to go into my folder actually I have to save it to my folder I also save it for myself ok so I was already preparing this data but if I didn't I show you how is it possible to and I'm going to sit down now to load any other new data sources here so if you want to load new data you just access here the import import icon and there are different types of database is what you can import so most likely in this case you would like to import something from your computer which can be a CSV file you press and then immediately this icon appears onto your working folder and if you double click I told you if you double click then then a communication bar is going to appear and you can select here what would you like to upload from your computer I already uploaded this thing so I'm not going to upload anything now but but later on if you want to do some experiment in your own folder you can upload your Facebook data if you have it or LinkedIn or any text if you want basically whatever can be uploaded and then separate it so then have to use those things and after that you press the import button and you are done you imported the data the API is closed you will be not able to download the data from Facebook anymore this has been closed a couple of years ago but you can do it manually ok, so yes you can, you should be able to and let me check you should be able to load any data because you haven't moved so you are still in the graph meetup you haven't copied your stream you will not be able to do anything with that but if I create my own folder so you have to put it first into your own folder so are you already put it into your own folder no, I haven't created my own folder no, your own folder has already been created so now you have to close it now you go to the save and here you have to put the users and then student whatever, I don't know much and then it is in your own folder you log yourself out the folder is there, but the file is not here ok, so don't log out student age ok, so once more if you forgot to save your file in the right folder then you come here and you put there users and after users you put your own username you are student age dash age and it should be there ok, there are a lot of folders created in the meantime guys, just follow what I ask don't do other things this is a beta version if you do other things, you are going to crash the whole thing ok, no, no, later on everybody can do some experiments if you have an error message you just click and delete the error message ok, so let's see what kind of data I have here, what kind of data I have already uploaded and to take a look you can just go into this green circle and press and then it gives you a table, a small table what you can immediately access and this is quite straightforward you can set the limit to 20 and then you got a 20 line and it is quite nice because I was taking care not to import some garbage so what you see is a facebook id a first name gender a location setting facebook rank facebook rank has ranked all my friends to order based on when they have joined to facebook so one would be or an early one like six Agnes is someone who joined very, very, very early and others are maybe there is a dummy variable for male, there is a dummy variable for Hungarian there is a bin variable for age and you can see that all the variables that I have imported are in string format so linkSkyte is reading all the variables in string format and then later on you have to convert them to integers or reels if you want to use them as numbers so this is basically the vertex set this is the set of vertices and I can already start to play with that because I promise you that I am going to show you a lot of different ways so I can just say that ok, I want to use this thing and build a graph for me so what I have is a table and I want to use it as vertices so I am going to select this one use table as vertices by the way if you just lost then you can put that use table as vertices and then you are going to find it in menu and I always forget how to close it up and then the thing what you have to do is connect the data with the graph builder and this is so easy that you just do this and if you are a little bit control freak like me then you can press shift and then you can order it in nice grids and the grid size is going to be equal ok so let's see what do I have to tell when I want to use table as vertices basically I don't have to say anything I can give an internal ID name if I don't want to use it as just ID and it is very necessary because this is the moment when the graph database is created in our system and it always needs an ID but for us it's not going to be important anymore and now if you click on the green thing you don't have anymore the table format but you have something which is a project it calls that we already got a project in that project it is very kind it tells us that we have 403 vertices that would be the n if you remember and no edges there is no wonder we have no edges and for vertices we have attributes we have the age bin the facebook ID the facebook rank, the first name the location settings and so on and it's so cool that you can just go in and for example ok, check what kind of location settings do I have yes so can you just click in yeah use table as graph but I wanted to use table as vertices use table as vertex yes, because again you didn't do what I asked earlier you're still in the graph meetup folder you don't have right access you only have access in your folder so you have to first copy it yeah, user student have you copied it no you didn't so first you have to copy it back back go to graph meetup go open it go to save and save it to user and users slash your folder anyone else needs some help you just connect it you cannot connect from here to here just drag just drag easy not like python just click so you use use table as graph but I wanted to use table as vertex use table as vertex use table as vertex yes yes, because again you didn't do what I asked earlier you're still in the graph meetup folder so you don't have right access you only have access in your folder so you have to first copy it user student have you copied it no you didn't so first you have to go back back back go to graph meetup go open it save save it to user and users let's go on anyone else needs some help use table use table use table use table so guys we are still in the graph meetup graph meetup graph table set crash set ok one more if you don't have right it you should have user, student, whatever if you don't have it you go and save it into your own folder now let's see so we have this beautiful distribution here of location setting and you can already see that out of my 403 friends there are 225 who are Hungarians actually whose language setting is Hungarian 137 whose language setting is US 35 whose language setting is 35 two Russian, two Croatian and two German ladies no don't know they have ladies we can find it out because we also have here turn off all kind of problems in the same time it's ok so we have boys and girls so for example if you want to check those two girls those two German friends are girls and I just write in filter female it's going to immediately to the filter I only have girls the filter is still active the location settings and yeah I have two girls so they are girls that's how it works and if I'm confused a little bit then I can see the total I have 403 friends 178 females then I did the selection the total is 178 and if I go here the total is 178 out of the 433 and I just click here into the small visual box and the two German ladies will appear actually the Croatian friends are also girls but the Russian there is one word on the because earlier there were two here to discard the filter we just eat it and click on it ok so now let's do some funny thing you all wait for creating a graph, right? we are going to create a graph you all think that I'm going to load in the edges between my friends but I'm not going to do that I'm going to build a graph I'm going to connect those people who are having the same first name and then you can ask me later on why the hell are you doing that but let's try to do that just for fun so for doing that I have to search for the right command it definitely should be something like connect connect vertices on attribute I think that looks something like good so let's select the connect vertices on attribute and so if you want to do now another trick and if you don't like this way of connecting the building blocks then you can just bring them close to each other then use shift and this is a real control freak version of creating your project you double click here and this is as that ok, so what do you want to connect with what do you want to connect which been a source to Facebook ID and in that case if one vertex H would match another vertex Facebook ID then it would connect the two but of course this is quite stupid thing it's not going to match so what I'm going to I would like to try to match is if the first name is matching the other's first name and then let's click the blue button and then I want to connect connect connect connect people and at the end of the day it comes out that I have 1310 edges between my between my friends and I think it is the right time to do some visualizations so you can do visualization from here but I really like to use the proper visualization button so I am going to click here this is the visualization, the graph visualization you are going to have this nice eye and then you click in that is going to give you one panel so that is when I click the eye it is going to give me the communication panel and if I click the green thing it is going to give me the output panel and let's try to understand what's happening here you should get something like that maybe not exactly the same thing this depends on the random generator probably the same ok so this is the visualization panel where you can communicate with this visualization thing here you can do a lot of settings so first of all for example you can visualize different attributes on the chart now it tells me that they are connected so they must have the same first name so I am going to put there the first name yes, both of them are Tamash Tamash is Tomas in Hungarian their settings is I can put there that the location setting I am going to use it as color so sorry maybe I was a bit too fast so first name I pressed label so when I pressed label it gave me the name labels everybody has Tamash Tamash or somebody is different you also have Tamash, good I will explain it I will explain it later so let's use the other the location settings with color and then it is going to show me that interestingly although I have two Tamashes here in this chart but one language setting is Hungarian the other language setting is English and I think now you are getting closer and trying to get what I am trying to do I am also going to put there the gender, sex and let's use icon so the two icons here are boy icons you can see clearly these two guys are boys and now a very good question why do I see only two because I should see 403 but it is not possible unfortunately to visualize 403 where you wouldn't be able to see that but it is possible to visualize a little bit more so just if you are done with that everybody has two different colors Tamash on the chart connected the visual setting part and you are saying that centers which is set now to auto automatic instead of auto pick me one instead of picking one pick me two guys pick so then it is going to appear one ID and another ID and the Tamash are remained but if you have different settings of the image then you should see Gary as well who sees Gary the way I am zooming in and zooming out that's why I wanted to suggest to use the mouse is with the help of the middle what is the name of the scroller scroller bar scroller thing if you don't have mouse and you wonder how is it possible to do I have no idea maybe you can search it in this help thing but I think you ask again the question that you selected two and now you see three it seems that the computer is doing some completely stupid things but it's not the case it is because there is also a radius setting and the radius was set to one so if you set the radius to two you are going to see exactly two sorry if you set the radius to zero you are going to see exactly two if you go here and say that I want five say pick and you keep the radius to zero then you should see five ok if it's really annoying to do this thing and you don't have the scroller then what you can do is you go here to network layout and ask let's ask centralized layout and let's press play does it work I even make it larger if you are interested in how I did that I again used the scroll bar and I was pressing the shift and the scroll bar and I can make them large or small and now we understand why do we have five exactly five vertices because because we set the radius to zero if we set the radius to one we are going to have eight vertices if you keep it on play then it's still going to nicely move a little bit and we see that we have two katas and two tamashis and two martons and if you go here and pick another five then we will see that we have three latis and one, two, three, four, five, six jánosis these are graphs these are connected components these connected components are always complete graphs this is just because I was generating it like that it is not possible to have not a complete graph because if somebody is called the same and if somebody is called the same with a third person then all of them should be connected so it is generating this way of connecting my friends has generated these name graphs in this topology and if you want to play a little bit more I will leave you a little time to do different visualizations even why don't you try out the three-dimensional visualization and why don't you try out the three-dimensional visualization by visualizing all of my friends so visualizing all of my friends I think if you write here star then you have all and I am not telling quite the truth because those who are only having one single unique name you won't see it in this three-dimensional chart ok, I let you play a little bit with visualization and also I give you some time if you have questions you can ask so the way how I was visualizing all of them I was putting here star or you can also pick a lot of numbers click in and put in instead of 5 you can write 50 or 500 and it is going to then it is going to create all of them the star you have to put here yes then it is going to visualize all your edges so clear those ones and put instead of just put star and I think the reason why it is not showing is because ok, so the reason why it is not showing and instead of the inbuilt visualization you should use the visualizer that is a better way of doing visualization graph visualization if I were you I would just delete this one because that is just not necessarily and do the proper visualization works? questions, anyone wants to ask questions ok, this is beautiful I really like it awesome I don't know how I did it you did the same but instead of visualizing in 3 dimensions you visualized it in 2 dimensions so if you want to visualize it in 3 dimensions you just click here and say that do a 3D and of course you get a nicer chart I don't know how I did it this was good, you just click the 2D it's ok, beautiful I don't know how I did it of course the idea here of this training how you did it but of course yes is it still trying to run it's trying to run but if I were you you just get out from here and try to get in once more so I delete this guy maybe you delete is it working like here you have something ok, so just delete the visualizer and try to run another one and the thing is that the visualizer is using your own resources so that that can crash because this is an apple ok, shall we continue so the thing what I want to suggest here is that we never thought that with a simple flat file having just names in it we are already able to create a graph and of course you question that why is this good I can give you an answer for example because earlier we saw that when I am going to do the same thing so because for example we can know that if we have 1, 2, 3, 4 Christas who are Hungarians whose language setting is Hungarian then the fifth one whose language setting is Great Britain it probably also speaks Hungarian the name is the same this Christa well it can be a British name but then not with sz which would be a typical Hungarian character as z same with Istvans and of course there are other names where you cannot be sure in that like for example Robert would be the same spelling in different languages so if I want to build a model who speaks Hungarian this type of representation of your graphs would be good would be not bad but also it would be really good to know whose are the friends because it might happen so what can happen is that this Robert here friends are non-Hungarians if all of their friends are non-Hungarians then probably this Robert here is not speaking Hungarian but if some of their friends are not only called Robert because this is not necessarily their friends but also having the language setting of Hungarian then he probably speaks Hungarian so this is the way how you can utilize different creation of graphs I'm just going to close it and I'm going to go back to the second stream to the second part and I'm going to see what do I have in the other database so the other database I have this vertex E, A connected to vertex B and I have IDs, these are facebook IDs so basically it gives me a list which profile is connected and which profile so what I'm going to do is I'm again going to go to the any questions so what I'm going to do error again you crashed it it's fine yeah no problem can you go back can you redo it ah ok no problem already happened, delete the chart and put in a new one that is not a serious crash the serious crash is if you crash all of our resources that's also possible to be done I think that's difficult ok so let's go to the second stream, the second data and I'm just going to I'm going to load it so what I want to do is these are the edges this is the edge set this is the vertex set, this is the edge set I have to combine the two things combine that I go here to build graph and I say that I want to use table as edges and if you click on that you see a hammer which has two sources and one output so one of the sources is a project so I'm going to use from this point the project you can connect also from here but don't do it please connect it from here so this means that I'm using the bare vertex only set without any edges and connect this as the edges so this comes a table I don't think it is possible to connect the other way round so I think you cannot mix it if you try to connect it to the table no you can, you can do it but it would give you an error message so don't do that project and the table to the table ok and here you have to set three things following me? so the three things first of all you have to tell the software that which is going to be the ID from the vertex set which is the unique ID for the vertex itself and here it is going to be the Facebook ID so this key this primary key should match with the edge edge set from and through keys so here vertex A is a subset of the Facebook IDs and vertex B is the subset of Facebook IDs again so this is the settings so you can do it the other way round like you can do this there is going to be no difference but still I would recommend to do it like that and if you are done with that just press the green button and enjoy the super high speed there are 3,500 edges in my Facebook graph and now I also see that these edge attributes appeared in vertex B it is not too interesting otherwise our data is the same but of course I don't have the connections based on the first names so let's go and also visualize this one now everybody can decide what kind of visualization they want to use but I tell you something if you visualize if you visualize any points because this is my ego network it could be among that points and if you go a second distance from that point you are going to visualize everyone because everyone is my friend so if you randomly select me you are going to visualize everyone and then it is going to be a little bit painful for the processor so let's not do that but let's do some nice visualization ok this is for me what comes out randomly as icons let's put there the location settings as colors I think this is already pretty nice and interestingly what I told you before about the language settings it seems that there is some homogeneity in these groups if you are thinking on the language settings it's also nice that if you go and select just one then they are going to appear only with their own friends and if you want to make it a little bit bigger then you can select randomly three guys and the interesting thing is that I think I think we managed to select one one two guys and girls one so all of their friends here two three this is for example a quite interesting lady all of their friends are also ladies and I also would like to suggest you to see the visualization and you can again visualize over the graph and I am going to visualize over the graph just put it up a value so random selected person random is one I am there and from me everybody is there so random is who, anyone is going to be in the full graph already so let's take a look of this graph would you think it is a smaller graph a smaller graph just look in here at this graph do you think it is a smaller graph you remember what was the definition of the smaller graph small characteristic length what do you think absolutely this is the most beautiful example for a small graph because everybody everybody is either one distance or two distance from each other very small characteristic length and you can see this nice clustering these are clustering these communities are those of my friends who are also friends of each other and because this is visible it is quite like that one of these groups make my family one of these groups is my classmates from school one of these groups are the friends my friends suffer with one of the groups are the guys I met here in Singapore the meetup group members this is and of course it is possible that some of my family members I will also play with some of my school classmates are also colleagues today and they are here in Singapore so it is possible that you as an entity are a member of multiple communities this property of crafts makes the thing super exciting actually this community and overlapping community cluster property will make that that thing ok, now let's while we are playing this we also show the labels who is commenting in this network so the three-dimensional visualization you cannot do that so if you are interested in the names you have to go back to the two-dimensional one but before going back I would recommend to take the radius to one to avoid to visualize too many points and then on this chart you can put there the names and you will see how they are connected to each other ok so do we remember what task do we have? ok, first task, find me alone I didn't bring I should have brought chocolate I cannot give any gifts next time next time the first one who finds who I am in the graph is going to get a beer next week in the meantime I can drink a coke this is very hot are those cold? cold or warm? cold? ok, relatively that is super warm who I am what's my first name? no who? no no, that's not me guys, this is my ego network I cannot be a guy who is not connected to everyone on the chart, right? no, because he's not connected to all of these guys I cannot be that do you think that I was so stupid to use my own name? when asking this question of course I changed my name the name is not Gabor you can filter out the Gabor ok, guys, I help so any kind of visualization you can do even a visualization with just one point which now in this case the starting point is Tomash you can see it because Tomash has a nice beautiful wide gloria here that's why he is this point this ID because this is one direction from him I actually can show this to radius, if I take to radius 0 this Tomash only if I take it to radius 1 this Tomash and his friends so Tomash is here connected to everyone I have to search for someone who is also connected to everyone because that is going to be me because this is my ego network and that is going to be this guy here that's me I'm connected to everyone so if I go here and select radius 2 actually go and try it it doesn't matter everybody is going to be on the picture because this is my ego network everybody is going to be there I'm going to stop the animation because that is using my resources ok but there is a more clever way to identify people this looks like really not a professional way and not a very easily automatable way to find myself so the easy way to find myself is to calculate everyone's degree the degree is going to show how many friends I have and because here it says that I have 403 members all together the person whose degree is 402 that is me if there are two persons with 402 then that is probably me and the fake copy of myself how to calculate degree the easiest way to calculate degree is just put here the compute degree or degree and it is automatically going to find it for you or you can use the quite logical setup here this would be a new vertex attribute so the degree is a property of the vertex every vertex has its own on degree on number of friends so here we are going to use compute what was the name of it calculate degree where is degree ah, I'm wrong, it's not here it's in the graph computation because degree is so important graph calculation like for example the clustering coefficient that it's here so from the graph computation I take compute degree it's a beautiful snowflake and I'm going to connect it with the project if I go into the communication bar it is going to ask me what should be the name of that variable yeah, degree, I like it and then it offers me a lot of a lot of different opportunities like it can be incoming edges outgoing edges, oleges symmetric edges, in neighbors, out neighbors all neighbors, symmetric neighbors so it seems that under degree it created me two different definitions one is edge and one is neighbor and everything can be incoming, outgoing all or symmetric now what does it mean edge edge means that how many edges I have but as I said in in linkskites it can be both directions then if I have an incoming edge and an outgoing edge then that is basically two edges if I have a loop into myself that's already one edge if I have multiple edges between two entities then every time it's overlapping edges there it's calculating as an edge so edge and neighbor is a different definition neighbor would be the distinct number of friends who are connected with me through edges in this case in this facebook case because we don't have loops and we don't have multiple edges it doesn't matter edges and neighbors are going to be exactly the same incoming and outgoing that would be different because as you saw earlier on the visualization there are there are some directions there oh my that was a mistake let's go back to radius one so you see that there is some direction here there are some different directions the way how this direction is created is that from the smaller facebook ID we are linking to the higher facebook ID but it is like it doesn't make any difference it could be the other way around or I could have multiple edges here symmetric edges, both directions basically this is an undirected graph but we are representing it still with arrows this is because linkskite is working like that unfortunately so here for the degree I would suggest to select all edges but you would get exactly the same result if you selected all neighbors it is going to result the same and then let's click here and then you will see that a new variable appeared here called the degree and we can click on the Instagram of the degree where are you and this is the degree distribution so I can see that there are 300 friends whose degree is between 1 and 21 there are 91 friends whose degree are between 20 and 40 40 and 60 blah blah blah blah and I have one guy exactly one I can zoom in who has 402 that will be me so if I want to see who is that I can I can go to the visualization let's go to the visualization let's connect it and I can say to the visualizer ok, I'm interested in Gabor so I want to select one center and I want to use custom restriction add restriction degree equals to 402 you don't have to put the equal you will know that this is equality and of course I should be a male so you can add other restrictions but you don't have to you are going to still find me so let's see hey oh yeah and you have to go to pick ok, of course and I'm not interested in the radius and I want to see the first name and it's me ok the degree was 402 just one minus all so you have to connect the visualization after the computation otherwise you cannot refer on that vertex A, vertex B what kind of data what it means so both are Facebook IDs yes not randomly, they are the friendships between my friends that is the thing which was downloaded from Facebook that is the relationship yeah, that is the relationship between them questions? yes the thing is that you press the next that's ok because it's already calculated who is that but then click out from here because now you are visualizing click out and put the radius to zero I don't want to see everyone you just want to see that one ok let's find my wife before we are going to search for my wife we are going to do another interesting data engineering stuff we are going to filter out the outliers who are the outliers here me I am the outlier because I have so high degree I'm just making the whole picture wrong anyway everybody knows that everybody is connected to me so let's filter out me from the network and let's see after that how the degree would look like so let's close every window and we can keep that visualization here and let's use for that we need filter so let's see how we can do filter ok, so we are going to do a filter by attribute because we are going to use an attribute to filter out records so I connected the filter connected after computing the degree and I am going to filter out myself so you can do it different ways you can say that the first name is Benneu or even more likely to be able to filter out if you say that I am not interested in that person whose degree is 402 so filtering it out you have to use the exclamation mark exclamation mark 402 or less than 400 that also works or even this works between 1 and 400 even this would work so there are a lot of different syntaxes what you can use here you just click out close the window and it is going to work and if you click the yeah, thank you very much if you click the green circle you are going to see it is calculating calculating heavily so what it does it says that ok we lost one vertex now the new number of vertices is 402 and we lost 402 edges so the new number of edges is 3098 all makes sense this is exactly how it should look like that's a good guess and Alexandra is very nice but none of those are true but we are going to go exactly that way so what you did probably was that you clicked on degree and said that ok so the next highest degree is 102 let's zoom in and there are a couple of files actually you can also use the logarithm the logarithm is so nice it is going to show you the logarithm distribution of that and then you can say ok so it seems that now the extremes are those which are above let's say 40 how many of my friends are above 40 13 so now if I am visualizing all these 13 friends and I check the gender and I am going to visualize maybe the size the degree then my wife must be among them so I am going to do exactly that I am going to use a new visualization and I am going to say that ok so now now I am searching for again I have to press here I am searching for all the customers where the degree is higher than what did I say 40 40 and of course I am just going to put the radius to zero I don't want it to ok and I want not one but I want all of them I think I have 13 or 12 or something so I will put their count 20 I have all the 13 let's put their names and let's put their gender I am going to use color for the gender and let's use size for the degree so that's why you said Alexandra because Alexandra has the highest degree and she is a girl and she is not my wife your sister she is not even my sister but let's see what is so what we know here what we know here no not very far from the truth Alexandra is really really really nice and I will tell the whole story later but let's observe this chart so what do we see we see here on this chart and we are having very high degree and we see that Alexandra is connected to almost every one of them but Sabina the second largest would be Kinga connected to Alexandra and many friends of Alexandra but not to Sabina, not to Zofia and Sabina here she looks really very alone she only knows Yuri but none of the other high degree friends of mine how is it possible that someone is in my network has very high degree but doesn't know all the other very high degree friends the only way that it can happen or the very possible way that it can happen is that in my facebook graph earlier there is one very large component and Sandra and Kinga and Julia and Zofia are part of that very large component but there are a lot of other smaller components and Sabina is collecting friends from multiple different components the big difference between Sandra and Sabina is that it's very likely that Alexandra's clustering coefficient is really very strong she knows everyone but Sabina is not like the way that and if you think a little bit that what is going to happen during your relationship with your wife you are going to introduce it to all of your friends but not all of the friends but the friends who you are going to meet for different purposes you introduce her to the family sometimes she is going to cheer you on your soccer game the colleagues or the friends so she is going to be part of multiple communities but not too many people of the communities why Alexandra might be there in one the largest community which is probably somewhere from the university where you collect most of your friends and she was very popular maybe at the university so to get an answer for that we should do a community search we should find so you remember earlier we saw these communities let's go back to that this was the chart but the beautiful three-dimensional chart so let's try to identify all those segments all those small topologies separately and see that in which Alexandra is participating in which Sabina and which the other girls are participating so for that we have to do a community search algorithm and the community search algorithm is a segmentation algorithm so here from the segmentation builders we are going to search for infocom find infocom communities algorithm and this is my communication panel and if you click to find more then it is going to tell the whole story of the community algorithm how it is created and I think it should even give you the article it gives also the article it refers to the article how on that how the algorithm was implemented this community search algorithm was implemented this is the article okay that's not so interesting now let's see the settings name for maximal click segmentation what why do I need a name for maximal click segmentation because the way how the community algorithm works is it first searches for all the maximal clicks and then from the maximal clicks it creates the communities so basically it is going to generate two type of segmentations and for us the community is going to be the important now this must be set to false because this one says that the search required in clicks in both directions in a telecommunication network the both direction means that you call her she calls you it's a bi-directional for creating telecommunication clicks and communities it might be a very important structure or feature here in Facebook would be also important if we had, if we were not using an undirected version of the graph but we are using so of this force means that basically it's true but the data is not visualized that way represented that way we can give a minimum click size we can keep it on default zero and zero six these are going to be different settings for the algorithms but let's keep it on default it is already going to give us I think a quite nice outcome so when we click here we already see that now we have two segmentations as promised we have a maximum click segmentation and the community segmentation and if now you click on communities and you say open it is going to create the community so nothing happened we are in a lazy mode spark is running the the query is in a lazy mode so only then it is going to be run when somebody is requesting some results from the run but now we are requesting because we want to find out how many empty segments are having and how many and how many let's try to understand what is here so remember we had 402 profiles friends in this network from these 402 vertices we have created 92 non empty segments we are having 92 communities different sizes from 3 to anything the total size of these 94 non empty segments is 706 so it means that an average segment contains 7 members but this must be so many customers or many people must be seen here multiple times because I don't have 700 friends I only have 400 so many many many of them are a member of multiple communities and that's why we call them overlapping communities because the communities are overlapping to each other that's why Sabina or Alexandra can participate in multiple communities and we see that we cover 373 base vertices so there are about 30 vertices which are not covered by this segmentation so 30 vertices are not member of any communities they are just like satellite friends of mine maybe they are my secret lovers no 30 would be too much so let's see the size distribution of them so it says that in total I have 94 communities the smallest communities are the size of the small communities are between actually let's do the logarithmic one because it's nicer so I have 44 communities with exactly 3 members I have 19 communities with exactly 4 members I have 8 communities with exactly 5 members and you can see I have a very large community with 198 friends and my second largest community is with about 30 friends 30 something friends so now I'm going to do again something which is a nice visualization I'm going to use this visualization here so I'm going to connect it with the new data with the community search and I'm going to say that ok, I'm going to do here something really cool I'm going to split the screen into two parts and for that I'm going to load the communities so it's already created, it's cool and I'm going to say that I'm going to visualize I'm going to close this visualization this part is closed and this visualization oh, I see one segment out of the 93 but I want to see the largest segments so do I remember the size distribution let's use the logarithm ok, so 30 20, ok, let's see everything which is above 10 ok let's see if the size is above 10 ok, I have 11 communities which are having more than 10 members should we let's see what if I use the equal maybe I have even more no, I have still 11, doesn't matter so let's visualize all of them I click here to the center I want to see all of the 11 ones and here are all the big communities of my friends so this is the very very very large one and the next largest one was somewhere this is the next it's 34 and all the others are 20 or less than 20 23, 20 less than 20 and now I'm going to put back this visualization and basically what happens is that the communities and the people are connected so Alexandra is the member of the large community but only one community from the largest communities Julia is member of this large community Janos also member of this large community Zofia is member of this community there are some who are also part of other communities but if you take Sabina Sabina is part of three large communities and with this one she is the winner she is the winner of knowing most of the people from the large communities so the answer would be it is Sabina who is my wife Sabina is participating in most of the communities of myself and although she doesn't have the top degree she is behind Alexandra is not much behind but she is behind her knowledge about my friends and that's why about myself is way higher than Sabina, Alexandra or anyone in that community and of course you are data engineer so you would ask me ok but again I had to segmentation and descent but if I have a huge database I want to immediately select these pairs for everyone not only for Gabor and for Beno and Sabina and for Alexandra's husband and want to find Zhewets and Gabor's wife so how can we do that and then comes the article this Facebook article so there is an algorithm which automatically calculates this value and projects this value what we calculated here on the edges so this is going to be an edge property and it is called so for calculating normalized dispersion I need the graph where I'm still in because I want my edges there so I'm going to come here and say ok where is this dispersion algorithm dispersion compute dispersion after the degree I can compute the dispersion change a little bit the the graphics and the good thing is that I don't have to tell anything to the machine it is going to calculate dispersion with these automatic settings it is creating this dispersion and normalized dispersion as the edge attribute and then I'm going to ask to calculate so remember I'm still in the database because there are 402 vertices and I can see that there is someone with 1.2 dispersion so what do I have to do, I have to visualize myself and everyone whose dispersion is high enough how am I going to do that how am I going to do that ok, what I'm going to do is I'm going to use filter and I'm going to select I'm going to select those ones whose dispersion is less than let's say 0.5 maybe this is not a good idea but let's do that so I'm again going to use the filter filter by attributes the good thing is that I can filter not only vertex attributes but I can filter also on edge attributes so I'm going to say that ok, I want normalized dispersion to be larger than 0.5 click here yes, it's calculating calculating I remained only 20 edges only 20 connections remained now, if I want to see all these 20 connections I cannot visualize all the 403 customers I should visualize only those where this dispersion is very where this dispersion is high so what I'm going to use I'm going to calculate again the degree but this is the new degree and then I'm going to do a filter on that so I'm going to calculate again the degree I'm going to give a different name so this is the dispersion, high dispersion degree high this degree based on all neighbors and then I'm going to use a filter everyone everyone whose high dispersion degree is 0 is not interesting for me there are 382 friends of mine whose high dispersion degree is 0 so they are not connected with the high dispersion edge to anyone I can filter them out high dispersion degree must be higher than 0 I have 20 edges and I have 21 friends and myself and that is going to be easy to visualize so I want to see everyone I want to see the names I want to see what do I want to see I want to see the genders because that would be important as color and I want to see the normalized dispersion with bit and with color color that is going to be a beautiful chart why don't I see everyone and why last law, who is last law I lost I am going to use this pick all yeah it seems that everybody is my friend here and the highest is Sabina and yeah basically that's it that's what this algorithm find out that she is the highest one I was hoping that we are going to get a little bit more connections so it would be really interesting to see dispersion among not me and my friends but with my friends and my friends and to see that we have to lower a little bit the threshold but also I have to again exclude me from the network because this is going to show actually we can do that I think it's not a big deal we just come here and say that ok 05 is too high let's see 0.25 bigger one can we see some connections which are not between me and my friends but with me friends and friends no yeah we're connected only to me so yeah basically if I want to go for more if you want to analyze connections between my friends then I have to filter out myself again but ok now we have an algorithm what we can automate which identifies Sabina who is the highest dispersion it's very easy to find the highest one and if we are going to go more then we can say that ok so here and we calculate the dispersion so the normalized dispersion let's call it logarithm ok almost everyone is 0 so what we are going to do is we are going to say that we want all the calculated dispersions and we are not interested in the high dispersion one and we are not interested in me so we are going to also where is the degree also going to discard me and it should work oh yeah I think I have one more mistake here I think I should use I should use the star there are no edges, why? edges, edges, why don't I have edges? zero edges oh my ok I think I know what the problem is because I'm still in the graph and I should filter myself out from the graph so I filter out myself then the normalized dispersion should work yeah now we have something of course it's not too nice let's try to make it a little bit nicer let's use animations and let's use yeah I'm trying to use different visualizations ok Alexander has a lot of big ones let's see who still has big dispersions which is the most red chill these are really nice matches so all of these connections are very meaningful ones I'm not going to go into one one of these but basically here we find the romantic relations or friendships or I mean strong friendships this is husband and wife this is yeah these are you with and Monica very strong friends I think also family I don't find my favorite example but I don't know this is mother in law and so this is basically I think I will stop here unless you are interested in the rest quickly let me summarize where have I my slide ok so why do we need this tool because working with graphs is super difficult if you have these structures as tables and you want to use SQL you would completely be lost probably you were completely lost like this as well but if you have to go down with the scripts and the syntax with only tables that would be very difficult so what I think is that linkskite is a great tool I mean the UI is a great tool to start to learn to work with graphs but after that you might not need the UI you might only need some really good Python libraries I'm not sure which version is here but the newest version yeah this doesn't have but the newest version has here a little button you press it and the whole thing is converted to Python code and the Python library so the whole thing can be easily automated and still using the power apache spark I think we learned how to create graphs and maybe if my next question is that can we do viral models then it would be yes we can identify who are speaking Hungarian whose language setting is not Hungarian because I can combine the name graph and the social graph and those ones who have the same name and same friends they are probably speaking Hungarian the whole thing I can train it for the existing Hungarian speakers so that should probably work what are the basic descriptive analytics of the graphs we learned that we learned that there is the degree there is the dispersion there is the connectedness the connected components but there is still this question why is graph a big data and so I think the answer for that is that you have a lot of big data type of problems but you can solve them with small data resources every time when you are able to do sampling from a big data and the sampling is going to create you a relatively good machine learning model then you don't need really big data it doesn't matter and this is the same vice versa so if you use non-big data methods on big data on big data you are going to get exactly same result if you are having the whole database or if you sample it for 1% or 2% of the data you are going to have exactly am I right? I hope I'm going to have almost exactly the same values for the logistic regression it doesn't matter if I'm training the data on trading the algorithm on 1 billion lines or just 100,000 lines coefficients are going to be almost depends on a couple of things but if there are not too many input variables then it is going to be the same so can I sample graphs? that would be the question is it possible to sample graphs? and just you saw these connected components and then everybody is appearing if they are friend of me just go to 2nd or 3rd degree it seems that it's not possible to efficiently sample graphs if you do a random sample of facebook if you randomly select 100 profiles from facebook nobody is going to nobody is going to be friend of anyone of the randomly selected 100 so that is a bias that is a strong bias if you do ego sampling that for the egos you are going to have high degrees but the friends you don't you are not going to have high degrees if you say that okay let's start with egos and let's go not from not one step from the egos but two step or three step if you go more than three step you are probably going to have all the facebook data because everybody is so close to each other so the big data problem here with graphs is that you cannot really sample well but if you cannot really sample well you have to analyze the whole graph but if you have to analyze the whole graph then you need a scalable tool and linkskite is a scalable tool and the last thing is can we use it for machine learning yes because it was possible to teach the graph who is my romantic partner who is my wife and it would be possible to do a lot of other things but is it possible to do supervised because this was not supervised learning I was saying that these are the features which are possibly true for my wife so we use rules and rules and rules and arrive to a conclusion this is not a supervised learning is it possible to do supervised learning on graphs and I think that that is going to lead us to some next question which is going to be answered on the next meetup any questions lost ladies first yes yes but how would you build a graph of an image so it is not it is not programmed yet for example loading the image into this system because the image is not going to be a nice CSV file you have to do a lot of filtering and transformation with that file but I don't think that ever going to be built in because you can do all those things in other Python libraries so first create some semi structure data from or some usable data from your images for example let's run a convolution network on that and instead of putting in the chart itself let's put one layers data as an input data and then you can do but then you use it as a vector and then your representation is going to be a convolutional network representation then the inputs are going to be that then the question would be how you are because these are going to be your vertices this is a cat, this is a dog, this is a face this is something how would you connect them to each other based on what parameter maybe you have some data for the pictures like this is from a film yes so it is possible it can do it but I don't see why it should do it why not you use different, what is the benefit of treating images as graphs and if I think about Facebook Facebook has a lot of images and Facebook can connect these images to the profiles and the profiles are anyway connected so Facebook can do it images to profiles and then connect based on that like this would be a bipartite graph from a bipartite graph of images and profiles you can build a homogeneous graph with only connected pictures and two picture is connected if those are friends who uploaded those two pictures then you can do something with that then you can do something with that then you can try to identify that those guys who are let's imagine you have a picture with two person those are the two profiles or they are different ones that can be a good question that's possible to do actually Facebook did a similar research they were checking that's like there are pictures with one person, two person three, four, five, six, seven, eight, nine, ten and then they were checking the genders and it turned out that yeah if there is only one person on the picture it's about 50-50 30, 30, 30 whatever it doesn't matter but if there are more people in the picture like say 7, 8, 9, 10, 12, 20 then the probability that all of them are boys is much higher than all of them are girls much, much, much higher so the probabilities the same gender pictures with boys are very much biased yeah yes and they do so that's possible that would be also possible with our tool but you need all those other libraries which are responsible for transforming your image data to something more structure you can load text and you can do text graphs that's very meaningful for example we try to use page rank on just articles and the interesting is that page view is more or less the content of the article so you can do that you can do those things yeah so this is any your proprietary library yep, yeah no, no all of these algorithms which are calculated in the degree the connected component the clustering coefficient these are written in scalar and these are our own codes I wouldn't say like priority because all of the codes have their algorithm but I think that the community algorithm is using this algorithm this is implemented in scalar under spark so that's the way how it is implemented and yeah so that time when we developed the algorithm spark was also trying to build their own graph library called graph X and we did a very strong comparison that should we use graph X for these for these algorithms or should we use our own scalar algorithms and it turned out that graph X was not good enough it was not scalable enough so it failed on large queries and that's why we built everything from scratch but we still use spark but not the graph X any other questions yes there is a question about how graph X energy was in some technical data and then there is a certain component and each component seems to be useful so is there any way to do that or is there some way to actually create an intercontinental model in in yeah so this text mining and knowledge graph building is a very interesting topic and a very vile research topic I'm really not an expert on it so there are thousands of researches or hundreds of thousands of researches and I did only very very few of these things that are doing real hardcore text analysis and text mining they are really good analysts but I can still tell a couple of examples what we did with links guide so one typical problem is to match match people match addresses match names, match IDs match things with each other and then if the data is textual like address then a matching is pretty difficult even if the matching is based on names it's pretty difficult like let's imagine that someone is called John Smith with that name you are probably going to match with a lot of people so then you shouldn't use name as a match but if your name is very unique at that environment is very unique you might be able to use that simple graphs this is very simple graphs the other would be the address so if the address is not as nice as in Singapore that with the zip code basically everything is coded in but it's like something like in Hong Kong where typically every street has at least two or three different names one Chinese, one English, one old, one new one low rise, one high rise whatever that is the code then creating a good database of addresses would be very difficult and it's free text and it's very difficult but what you see that an address is still a flow, a graph of vertices every vertex is part of the address like the number the road name the type of this is a road or square or whatever so then there would be a lot of vertices connected and depending on which position you are what comes after what how many how many times you are appearing in the database because the street is going to appear a lot of times but maybe a name of the condo would appear less times so based on these things these things it was possible to create a good algorithm to correct and match addresses these use cases are something what you are looking for or this is still too basic ok, that's different this is already a knowledge map this address tokenization but the knowledge map so the other thing it's very typical how they do you have a big text like an article or something and you tokenize it first and then what you do is every word is a vertex and every time a word comes after another word of course after tokenization then this is connected there is a connection and that can help to build up some kind of a knowledge about the text I think most of the AML algorithms are sorry the natural language and NPL algorithms are using similar techniques with graphs and I think the information like that and it's very easy with links guide to do that but then after what you are going to do I don't really know yes so gigabyte wise or terabyte wise it is still not as big as video streams obviously but that vertex and edge size the largest one was having 600 million vertices and I think 2 billion edges something like that we didn't do this visualization 3D visualization of the one network but it's you can visualize really big ones by the way if you are interested in how big graph these guys were visualizing I think I have something on that that was a very nice table so for example that was a quite big one but still not as big as what we analyzed with links guide this actor network database contains 700,000 actors and every 2 actors is connected with each other if they were playing in the same films so there are about 30 million connections between them which means that one randomly selected actor is having about 80 other actors who they were playing together with in average this is the famous Kevin Bacon graph but this is still small to links guide so you can put easily 2 zeros behind these numbers and we are going to be able to analyze it with powerful Hadoop cluster behind it but still it would be possible I never completely understand the technology but how we are working typically is we are solving problems so if the problem is identified and it is a graph problem or if it's matching with us we go, we run a project and then we can decide if it's needed to be automated because the thing is that when you go that you don't know the accuracy is going to be good enough the implementation is going to be necessary so you are probably going to do a first run or a couple of runs and then after you do the automation and then we can just leave either this links guide there and they can play with it or we can be responsible for every month, fine tune it and maintain it there are separate engagement, different engagements with different companies not necessarily like when we started five years ago nobody had Hadoop clusters so we were just going in with our three machines or four machines and we connected them together today one thing what they can do is just to use Amazon or Google Cloud or Azure or if they don't want they have their internal ones so we just go and install you don't prepare those clusters for them? not really but we can and at the very beginning we were so we had to learn all these things but today I think today most of the companies realize that that there are very good players in the game cloud there are AWS, Google, let them do the thing and like for example Vodafone Vodafone has decided two years ago to put everything on their own big data clusters so they have to buy machines and today they realize that it's not going to work so they agree that they are going to use cloud cloud service yes yes this is a typical so the name of these graphs is these biperit or triperit or multiperit graphs we have experience for example devices phones and phone numbers because your SIM cards can be put in many to many type of connection one SIM card can put into many devices but one device you just put one SIM card then another SIM card and another SIM card so it's a many to many connection and you can learn a lot of things on that for example you can identify these two SIM cards typically used in the same device maybe they are friends or maybe they are family members and they use the same SIM cards using the same device so that's a very it's a very cool topic you can do a lot of things with biperit graphs with links guide maybe let me ask one question so how many of you are data scientists doing machine learning all others are data engineers no so who are data engineers so who are you guys who come to be data engineers and data scientists they are software engineers who are software engineers who are shy cool thank you very much guys I learned also a lot so you were really great I usually do this much much much longer 6-7 hours you were doing very fast it's very late now it's very warm so even the dodecahedron was a complicated stuff but you did great thank you very much if anyone is interested in the future in graphs or in links guide or in links just reach me out you will find me on the meet up all the companies are hiring for data scientists data engineers, software engineers so I don't have to tell that