 Okay, can everybody hear me? Yes, seems to work Cool. So thanks everybody for sticking to the last session. I hope I can provide some interesting information and Maybe some entertainment. So I'm David Bartolini and from Oracle app Zurich and I'm going to talk about notebooks as enablers for Graph-empowered machine learning and this is you know, what I will present is work shared by several colleagues at the From our collapse sir. It's just not not just for myself, but let's start from Like a what is what is a machine learning system like what is it like? High-level view of like if you're the boss of a company you try to hire a bunch of data scientists This is what they're trying to do. This is there is some magic linear algebra in a pile And then you throw some data in the funnel and then magically you get some answers and it's even worse, right? Data comes from a bunch of different places can be rational database. It can be some JSON files It can be some CSV files different types of sources of data You need to make connectors for those put them in the funnel and then Somehow use the the pile of linear algebra, which is really a whole bunch of different tools You have spark you have pandas DL4j you can have some are all different bunch of tools right and then you're trying to make some sense out of this And you have this this pole that you used to to kind of Steer the pile so that you get some answers So what is what is the tool that it is emerging as that pole? It's the notebook right? I don't have killing machines in my in my presentations, but but I have some notebooks Right so the the idea is that the notebook is kind of the tool that you use to stir your pile of linear algebra And get some answers and what's nice about the notebook is that? Through this concept of interpreters for different languages you can actually have one interface where you can Kind of control what happens to your data and what systems? It goes through and and how you you build your answers like so like for example here This is clearly it's a cartoon But it's like these things actually you know we have implementations that work like this So you can have some sequel code that does a sequel query Except some data from a relational database and then you pass that query that the data into a Python code that which does some pre-processing right? and basically generates some some entities and Model and you can model your data for example as a graph, which is what the main topic of my talk are using graphs for machine learning through notebooks And then once you have a graph you can you can you can you can load it into a graph processing system For example pgx, which is what we develop and then you can query the graph so that the Right-hand side at the top. This is a pgql Paragraph and pgql is a query language that we're developing and it's kind of like sequel But it's for graph so you can express graph patterns. You can say that there is Some v vertex connected to some w vertex through an edge And then you want to query it and then you can in a notebook is nice for graphs You can you can visualize because you can visualize them so you can do a query then you get the visual representation of the graph It's really easy to inspect and then you can Do some learning some machine learning on the graph and then for example find graphs that are similar to this one That said So notebooks, you know a very nice environment this talk is not really about notebooks Though notebook is kind of the pretext to talk about machine learning using graphs and Again, so the nice thing about the notebook is that via the interpreters It makes it really easy to integrate all the data sources build a graph and then do some learning on the graph and extract some valuable information So the rest of the talk is is giving a few examples of some use cases where we can use graphs and extract information from graphs to to learn some insight insight from data and So I have a couple examples about like anomaly detection doing some prediction Doing some similarity search on graphs using machine learning techniques and graphs Yeah, and of course, I'm not trying to sell any products I have to to put this light to make legal happy and just giving some ideas So Let's go back to the data right so the models Machine learning models, but in general any type of data processing is only as good data So so where is the where is the the insight in your data? Especially for with big data is it's telling you great You have just a lot of data in different formats. It's where is the where is the insight? Where is the valuable information with the graph? It's in the connection. So let's look at an example for example if you want to do product recommendation what you're looking for is like You know, you want to recommend the product to me You're looking for similar people who bought similar product and then you will recommend me those products But what does it mean? What is that similar? How do you define the similarity? So the question is really how do you look at all your purchase data and you find The the products that you should recommend How can you represent this and one answer which has proved to be pretty good is using graphs So the graph is a data model. It's You know, you can translate from relational models to graph model or from no sequel model to graph model The point of the graph is that it highlights the Connections in the data so in a graph you have entities which are the vertices and you have relationships between the vertices which are your edges and Then what you can do after you have a graph is do for example some queries That is that are quite quite hard to express in in sequel for example if you want to find a Connection between two entities with any number of edges in between Which sequel you would have to do like an indefinite number of joints if you use a graph query language You can just express that you want to look for this connection and then you can get your answer So there's there's basically two different ways to exploit graphs for Getting you know learning something from your data one is more trying to use directly The information from the graph by running some graph algorithm To to to extract some knowledge from the structure of the graph as well as from the properties of the entities And then use that that information directly to to to answer some questions the other and the other one is trying to actually still kind of Map the graph into something that the machine learning algorithm can exploit and then you learn in a more traditional machine learning way Something from the graph and then you get your answers from them all so what there in the rest of the talk I'm gonna show a few examples of both of these ways in which we can exploit graphs To get value from from your data So the first example I have is Anomaly detection in health care building. This is an example of like the first category I said before so we're gonna use Some analysis on graphs to extract anomalies from a data set This data set is a is a public data set from the US Medicare Center. It's basically a building data set from hospitals from 2012 It's quite large. Although not huge. It's like nine million about non million nine million records with 29 variables and it basically includes like Doctors and some Some Some some some visits or some other services that the doctors made on the patients and it's basically used to used to creep track of the costs of healthcare So what we're trying to look for in this graph is anomalies as in So the graph has as I said doctors that provide services or treatments or like prescriptions and so on and Obviously each doctor has a specialty. There's you know, there's gonna be a dentist and eye doctor You know all kinds of doctors and if we can expect that each doctor will will be offering Services of of their own specialty. So the anomalies we're going to look for is some doctors that perform some operations Which are typical of other specialties So those are not anomalies in the data because it's it's unexpected And then you may want to to kind of try to look into it and analyze why this is happening Maybe it's somebody is trying to put in more More more treatments that they actually need to to get money from the tax system or something like that, right? So, how do we find these these cases? You know if you think about anomalies anomalies are rare just by definition. So we have many of these Entries in the data set and we are looking for very few which are anomalous compared to the to the overall so it's you know and and the thing is like it really depends on on on You know, we're looking for some treatments that are done by doctors which are not Supposed to do those treatments or it really defines on depends on the definition of what is normal So, let's see how we can try to do this with a graph. So We model and we we model the problem as a bipartite graph with the doctors on one side and Health services or prescriptions and so on on the other side And then there's a connection between the two sides if a certain doctor provided a center certain services service at some time And and we have additional information the graph like you know, obviously the the who is the doctor and the type of Service they they offered and so on now based on this graph what we want to do is Try to find Some doctors that are close to each other that's going to define what is the expected kind of services they provide So so basically what what what is going to happen that all the doctors that are for example from internal medicine We'll be providing a similar set of services and then another specialty will be provided another type of services We are looking for somebody who is from from the other specialty but also provides quite a lot of these other services from the from the From some for example here. We have a plaques plastic surgeon who is who is doing internal medicine Services, right? So this is the kind of anomaly we're looking for How can we define this? From the data right from this graph So the idea is to use the personalized page rank score So personal personalized page rank is a is a slight modification of the classic page rank algorithm So page rank tells you how important a certain vertex is in the graph Based on how many connections it has with all other nodes in the graph It's basically the class there, you know the original algorithm that Google started using for web search Personalized page rank is a big different be different because it starts from a subset of the nodes Which so we're gonna start from all the for example all the all the eye doctors That's gonna that's gonna be our starting set and then we're gonna go Simulate random walks to all the through the graph to all the other vertices and then each vertex will get a score which is proportional to the probability of Ending up there with a random walk starting from a starting set Right, so basically the idea is that the vertices in the graph will have a higher PPR score personalized page rank score if they're closer in the graph Which means like more connected to the original starting set This I did you know using PPR is more robust than you just using something like shortest path That's that's a straightforward distance, you know way to measure the distance between two vertices that you could think of But if you think about it shortest path doesn't really explain You know it treats the cases that you see in the slide in the same way So A and B are the same as C and D in if you use shortest path There's just one vertex in between but intuitively they're quite different right in the first case There's many more connections between C and D in the second case There is like many others things that go through the through the middle node So the definition of closeness is is better captured by PPR than by just Shortest path Now the idea then is to use PPR score to say which are the anomalies in the data set So we we select a specialty for example Automatrist and then we find a set of doctors of Other doctors that have a high PPR score starting from the optometrist and those are anomalies The issue with this approach is that it has quite a lot of false positives. Why? because if you think about it some some Some procedures are just common and everybody all the doctors are gonna do some some of the procedures So through those basically everybody goes through those. So if you just look at the PPR score, you're gonna get Many false positives. So we need a way to filter those out and And really only get the actual anomalies in the data set The way the way you can do that is to notice a property of of The page rank right so if you think about PPR as I said is a special case of page rank Where you start from a subset of the north instead page rank starts randomly for and from any of the vertices So basically if you if you think about page rank the The specialties or the the prescriptions or operations that are very common So that are done by many many doctors are gonna have a very high page rank score Just because they're connected to many vertices in the graph So the idea here is to use Instead of just using the PPR score to detect the anomalies is to use the difference between the PPR score and the page ranks Core of a given vertex because the page ranks core tells you like in general in the whole graph How connected is this vertex to everything else? The PPR score tells you how connected it is to the starting set. So the difference will tell you So if some vertex has a higher PPR score than PR score It means that it's more connected to the starting set than it is generally connected to any vertex in the graph So this is the the method that we're gonna use to highlight the anomalies For example, this is like a excerpt of the if we compute the PPR minus PR score for all the vertices in the graph and we look at the Procedures, this is what we get. So the the highest scoring ones and Here we're using the optometrist to as a starting set for the PPR The highest scoring ones are very specific to optometrist or I related operations While the ones that have even negative score so they have higher page rank than PPR are like very generic operations, so this is like kind of a short way to show that the The score metric we use kind of make sense now so going back to the the the whole idea what we're gonna do here is like the same as before so we compute for each Specialty the personalized page rank of all the vertices, but also the page rank and then we mark procedures that have PPR minus PR greater than a certain threshold as a specialty procedure for that Specific kind of doctors so if I if we start from eye doctors all the procedures that have higher PPR score than PR score Are gonna be marked as special Specialistic procedures for that type of doctors because it means they're more connected to those type of doctors that Generally to anything else in the graph at this point We still have one problem that The the the categories of doctors are not It's not a partition. They're kind of overlapping. For example, we have optometrists optometrists and Ophthalmologists we kind of do which kind of do the same things because they they are related, right? So what we want to do is like to not mark as anomalous those doctors that are in a category where most of the doctors have Satisfy the PPR higher than PR score because those are like we say okay that category is too similar We're interested in the category of doctors that is not very similar to the ones We are looking at but just some of them are end up being very close So let's look at the graph maybe it becomes to be clear So the in this in this in this graph we have The colors indicate the different specialties of doctors the x-axis is the Is the difference between PPR and PDR so it's score we are using and the y-axis is the count So how many how many? How many vertices are? Exists for each of the values of PPR minus PR If you look at the all the yellow Specialties those are have very low scores even lower than zero It means that those are not interesting for us because there those are not anomalous There's so we're looking at at the optometry doctors here And those are very far from optometry doctors because they have a page rank higher than PPR and then we have the ophthalmologists with are the which are the the second Histogram there which which have pretty high PPR minus PR score But like all the category has it. So again, that's probably not an anomaly It's like just a similar cat a similar category of doctors what we're really interested in is the is a short spikes Next to the next to the zero line Those are like there's just a few doctors from those categories that have higher higher Personalized page rank than page rank, right? So let's look at an example. So here. I think we're starting is the same example. So from from the eye doctors and what we find by sorting the the the score is that we have actually a Few specialties that are anomalous and like we have one Radiologist and one gastroenterologist that are doing operations that are typical of eye doctors Right, so let's look for example at the gastroenterology and we have this doctor who is doing like removal of eye fluid and blah, blah, blah All these sorts of so all these sorts of of eye specific operations So what you can do in with this kind of analysis is if you get some data And then you can inspect and see whether it actually makes sense or not. This is this is a very important Property as was mentioned before like being able to explain what the algorithm algorithm tells you is is is quite important So now I have a short demo that shows this implemented at our studio Connecting to the which connects to to to bjax, which is the The graph processing system. Let's see if I can play this Okay See what happens Obviously not Yes, okay, so I Hope it's big enough. So what's happening here is First it's connected to the the graph server, which is the graph analysis server and then it's It's loading the graph into memory. So the graph that we modeled With the doctors and the specialties is being loaded into memory and then we define the specialty of interest, which is optometry Right and we define the parameters So the anomaly level is 5% and then we run page rank and personalized page rank on the graph So, you know, those those metrics are being computed at this point We're gonna run a couple queries of those pgql queries that that I showed before to select and show Basically what I was showing before in the slide so this is selecting the The vertices where the specialties what we were looking for and then it's selecting the the the kind of operations Or or or prescriptions that they they do so here we get the list of what I doctors do and And we get the the score of personal life page rank minus PR So these are so the top ones are basically the the specialties that are typical of I doctors, right? and then We run another query Which selects instead the generic procedures, right? So this is again what they showed before these are the ones that are Provided by most doctors. So there are the ones that we're gonna filter out from the from the anomalous cases For example initial hospital care and so on right and then we run this other query To identify the anomalous cases as I explained before and we're gonna get that The most similar Category to to the to the eye doctors is the ophthalmology Which is very as very high similarity and so on right and then finally this is basically the last result that they showed in the slide we get that We get you know some gastroenterologist who was doing a lot of these specialty procedures and then we can inspect What those procedures are and and see and see the results And we do this both for the gastroenterologist and this is exactly the same results I showed before and then we also do this for the for the other one. We had I think was a cardiologist So so basically What this is trying to show is that you can exploit the connections in the graph to extract some information that may be Really non-obvious? If you don't if you don't consider the connections in the graph, right? Yeah, so this is gonna do the same for the cardiologist So this is first example as I said was to kind of motivate the fact that Looking at the graph is important and it can provide very important very relevant information now. The question is How can we? Try to do this more in a more automated way because here We were basically going there and defining metrics and then basically using the results from the graph directly There was basically no no no learning algorithm, right? But then the question is How do you connect graphs and machine learning graph is a is a is an The graph encodes parcel relationship, right? It's like Connections between entities and and their sparse but machine learning normally works on on dense features So you have a you know one nothing calling vector and it's a dense vector, right? So how do you do you do you connect those two things? So this is a pretty hot topic in your research. There's been like in the past, you know four or five years several Proposals the core idea is from to 10 2014 and this is one idea we build on It's called deep walk and the idea is that you can actually use tools that are already existing from network network language Processing to translate the connections in a graph into something a machine learning algorithm can understand So the idea is that you take your graph and then you do random walks on the graph to Basically extract some strings and the string is the sequence of of visits That you do in your random walks So for example, we start from from vertex one you go to two and then to four and then to eight And that's one random walk and then you do another one and so on and you generate all these random walks and then you consider those as if there were sentences and Single visits of your vertices were worth Then you take this representation and throw it into a natural language processing model called For example word to vex or similar techniques that basically turn by by using a special kind of neural network They turn these sequences of of words in our case vertices into Vectors and then you get the vector for each vertex in the nice properties that distance in the in the multi-dimensional space between two vertices Between the two vectors that are related to the two given vertices will be small if the two vertices are closed in the graph So this property is very important because you can map the graph onto a multi-dimensional space Where the distance is the distance for serves the the the vicinity in the graph Let's look at an example application Let's say let's say we have a data set where we have students and we have courses in a university and for some students We know the department for others We don't we want to be able to predict the department for the students that we don't know the department of But by by by leveraging the graph and if you think about it This is the same as a customer segmentation problem where the students is a customer and taking a course is like Purchasing some item and the department is there is the segment right? So we're trying to predict the segment of of those customers. We don't know the segment of right? So what you can do is use like technique that just just explained and we're gonna I'm gonna show results for four different cases the first case is just a traditional convolutional narrow network Trained on just the features from the vertices So like the age of the students the course it took and so on and then we try to use just PPR the same as before And use the PPR score as the prediction metrics to to say which which is the apartment So the the the the vertex which is a student which is the highest core score in PPR The department of that student which is known will be predicted as the department of the target student Then we try to mix the two things and then train a convolutional narrow network by using the embeddings that are generated using DeepWalk And finally we add the categorical features to the to the same CNN The the first result which is probably quite surprising is that just using PPR is better than using the convolutional narrow neural network on the features This is I mean it's both surprising But maybe also obvious because what you're doing here is like using better information to do your prediction You're using the connections from the students to the courses and you're using the structure of the graph to do to the prediction And the nice thing about PPR is that it's completely unsupervised You don't need to train anything right so this is the kind of an example of of the same technique we were I discussed before but then when we try to use the to learn the embeddings and Then train the network on the embeddings This is the green the green line and yeah by the way So the x-axis is the iterations of the training of the of the of the convolutional neural network and the y-axis is the curious is so curious it tends to improve obviously by by by by training more And then the so the green line kind of approaches the limit of PPR With with more iterations because basically you're using the same information. It's just translated into into into the Into the vectors But then if you also use the other features you can even you can get better better results than than both of these because basically You're just using more information in some cases the structure of the graph was not enough to predict But if you also use the categorical features, you can actually do better Let's look at one what let's go one step beyond now in the in that case in the in the students case I was trying to predict a property of a certain vertex in the graph another problem that people have looked at is trying to Classify somehow and entire graphs. So let's say I have I have a large graph made up of smaller graphs and then I want to find some way to compare them and say oh This for example, I have this graph, which is for some case for whatever reason interesting And then I want to look for similar graphs in my in my in the space of my graph. How can I do that? There is again a technique from from from from research And then we have we have our own extension that that enables you to do this Basically, the idea is very similar to the oh actually first. I have an example to make it a bit more concrete think about financial transactions and Want and you want to to identify money lender money laundering in in your graph So basically circles of money that is being trying to be hidden, right? So so what normally happens is that there is there's a department that looks at these transactions and tries to identify Patterns that correspond to money laundering and then there's some system that identifies this pattern and And you end up with a lot of false positives Many and many specialties that need to go there and try to understand whether this is a this is actually Money laundering or not so the question is can we can we train a machine learning model to even learn new patterns from no ones? So if if I have a library of patterns that are labeled as as Money laundering cases, can I can I train a machine learning model on all the patterns in my graph? defined by subgraphs and Find similar ones that can be you know sent for you know There can be an alert and that they should be they should be farther investigated. So this is the case. We're talking about now The idea is quite similar So the idea behind this is that now you don't have just one graph you have several graphs And then for each of them you do the random walk dance that we did before and You get again all those sentences for the walks But now it's like you split it up into paragraphs each paragraph corresponds to one of those subgraphs and Then again you turn to natural language processing and you look for a technique that that you can use for this and turns out There's one. It's called paragraphs to back. So word to back was mapping words to vectors This is mapping paragraphs to vectors. It's it's it's it's a bit different But it gives a similar similar result But now what you get is a vector for each of the graphs So for the the the first graph will have a vector second graph with a vector and so on and again the distance The specifically the cosine distance is normally used between these vectors in the vector space will be small if those two graphs are kind of similar So this is the technique that you can use to to identify similar graphs in a in your space What we what we've been working on is this thing called PG to back and so graph to back is like intuitively mapping graphs to vectors, but it only looks at the Like the label of each of each vertex So it just looks at like I go from vertex one to two to three and so on it doesn't look at the properties But in real graphs properties and so that the attributes of each vertex or even of the edges can have a lot of information inside right, so as we do here is not only consider the The vertices the idea of the vertices in the random walk, but also consider the properties so then if you have two different sub graphs where the The random walks are similar, but they also have similar properties Those two will be more similar than two other graphs where the topology is similar, but the properties are different So basically what we're doing is adding more information for the algorithm to to improve the the accuracy Additionally, what we do is instead of considering just the vertices in the walk as the words in the you know When mapping to the to the natural language processing algorithm we consider edges So basically the pairs of vertices that you visit this Also improves accuracy because it it provides more information to the neural network Which knows that I not only went from here to here, but it's like I went I connected these two specific vertices in in the through an edge Additionally, we also put global properties of your graph for example in in the In the in the financial transaction use case You may know that in this specific pattern you had two entities that are similar and then if you insert that That information into the graph into the random walks as well or for example the size of the sub graph you can get even better accuracy so with this With this algorithm we run an evaluation on on the on a data set about cancer, so it's it's about some Protein I don't remember if it's no it's not proteins is Well chemical compounds that may or may not be related to cancer, so some of them are labeled So you know that this compound is related to some kind of cancer some of them are not and then you want to find For some unknown Compound whether it can be related to cancer or not And this is an example of a visualization of these graphlets Of these compounds these two of them and yeah So it's basically represented by vertices that have a certain label attached to it, and they're connected in some kind of molecule Now the evaluation we run is that With PG2 back we can actually get very high accuracy on predicting the label of whether a compound is related to cancer or not And much better than graph to back because we are using the additional information. I Have another short demo about this If I can also run that that basically shows a notebook environment that we're developing where you can Yeah, so basically this is getting the similar Graphlets to starting from one of them and then you can run Get to basically the properties or the count of the labels in all those graph Graphs that have been found similar from the algorithm and you can see that they're quite close, right? So this is again a way to inspect the results of the of the of the algorithm So there are these are the four closings once and then we can do visualization of course And then we can just run and get that this is the the the query one So the one we were looking similar graphs for and these are the ones that are found And you can see that there is this circular pattern with the three with the with the vertices Labor to three that their curse and then the the other one so you can actually look at the data and see that what the algorithm told you Is similar is actually visually similar, right? So this is all I had You you know, this is actually Implemented we have we're gonna have a beta version that includes the these machine learning libraries for graph It's not available yet but But we're gonna publish it and So pgx which is the the system that I was that have been using for the graph analysis Is part of bdsg and was you which are the special and graph options for big data for the Oracle database? you can and and with With this feature that that is going to be beta for now you can load the graph model compute And then and then you know create a graph model and then export it and then use it for whatever machine learning Task from with with your favorite ML framework. So we're not doing an ML framework We're just doing the part that Extracts the data from the graph into a form that the neural network or some other machine learning algorithm can understand which is the kind of the missing link to exploit this information and Yeah, so I think the cure code should point to the tech network page for pgx. So this is just for the graph engine The version that's out there now does not have this feature yet We are still working to publish it But pretty soon we should publish 3.2 and you can just download it and and for for trial purposes Just run it and then see if it works for you use case That's all I had. Thanks. I think we have a bit of time for questions before getting a beer Any questions We have a question here You have shown in the demonstration how you you were loading data How long it takes to to load the nine million? Records and I don't know that's a very hard question Because I understand that the demonstration was maybe Comprehensive. Yeah, it was a bit compressed. I don't remember. So that's the real answer but the Comment is like it obviously depends on on where you're loading data from So the performance I mean for example, if your data is in a database What we can do is for example load like run multiple connections to loading parallel So you can speed it up but I mean eventually depends on the bandwidth you have to your your data source and it depends on like how You know part of the of the task is loading the the information about the vertices and and that just part of the task of loading is Also creating some indices on top that allow the graph engine to to answer the query is fast So we create an index so part of it is is creating the index But that should be hopefully fast because it's it's parallelized But yeah, I mean the general answer is I don't know but it should be fast. Thank you So thanks again, and I'll be around if anybody has questions