 Hello, and welcome my name is Shannon Kemp and I'm the Chief Digital Officer of Data Diversity. We would like to thank you for joining the latest installment of the Monthly Data Diversity Webinar Series Advanced Analytics with William McKnight. Today William will be discussing understanding the modern applications of graph databases. Just a couple of points to get us started, due to the large number of people that attend these sessions you will be muted during the webinar. For questions we will be collecting them via the Q&A section or if you'd like to chat with us or with each other we certainly encourage you to do so and just to note the Zoom chat defaults ascended just the panelists but you may absolutely taste this network with everyone. To find and open both the Q&A and the chat sections you can find those icons in the bottom middle of your screen for those features and as always we will send a follow-up email within two business days, continuing links to the slides, the recording of the session and any additional information requested throughout the webinar. Now let me introduce to you our speaker for the series William McKnight. William has advised many of the world's best known organizations, his strategies form the information management plan for leading companies in numerous industries. He has a prolific author and a popular keynote speaker and trainer. He has performed dozens of benchmarks on leading database, data lake streaming and data integration products. William is a leading global influencer in data warehousing and master data management and he leads McKnight Consulting Group which has twice placed on the incorporated 5,000 lists. And with that I will give the floor to William to get today's webinar started. William, hello and welcome. Hello and thank you Shannon. Coming to you hot today. In at least one way I'm in Texas. That's how it is here. And I welcome you all to understanding the modern applications of graph databases. This is always an interesting topic for everybody. I feel like graph databases have yet to have their big day in the sun in the form of being present in every enterprise. I'm a huge advocate though and that may come through. I am not a full on graph guy meaning that's not how I spend my day every day all day like some people. But I think that that's an okay thing to present this because I see a lot of different architectures and different ways to do things and I can kind of put things together in a good way I think and that's what I'm going to do for you today for graph databases. I'm going to give you lots of case studies, use cases if you will and hopefully you can find your way in one or more of the use cases if you're not using graph already. There will be a little bit of graph 101 in here just so that everybody can come along with me into the use cases and at the end I'm going to talk a little bit about an emerging technology that I'm being asked a lot about that I think has some overlap to what we're talking about here today and that's vector databases so stay tuned for that. Now let's start with what technology do most graph databases get their bounce from? It would have to be relational databases. Now graph databases form the foundation for a lot of greenfield or new applications that enterprises are taking on for the first time of course but at the same time there are a lot of relational databases out there trying to do graph things and maybe the hard way and maybe there's an easier way so we need to make that a priority to get those workloads over here and I'll make the case a little bit here today why that would be. The idea of everything in columns and rows really a thing of the past a sequel becomes less important as technology involves. The technology of data will certainly evolve we're certainly not sitting in any place in time where there's one size fits all and there's one place for all data and that's it. There are many places for data and you have to get it right and get the data flowing in the right way as well and this does not mean that all data is in one and only one place certainly not as a matter of fact a lot of data that ends up in a graph database is somewhere else as well but not for the workloads that the graph database supports so graph is kind of like really the connecting topic out there it's for these networks and relationships as will be shown here we can make better predictions by utilizing relationships and a lot of large companies are using this right now and besides companies as well if you're using any Spotify list that uses graph machine learning in the background if you've used Google Maps you've used graph machine learning and there's also companies that give us medication any drugs out there or researched with graph databases and a lot of the modern research of anything really is actually driven by graph research so what it does is it extends the reasoning from a single entity to reasoning about an entity in context in context of all of its relationships so relational database systems very very necessary inside of enterprises of course and they're really great when you care about those individual records so for example if you want to insert and retrieve information about individual products relational databases are great for that on the other hand graph database is great at capturing those dependencies between entities and extending the reasoning from a single entity to reasoning about an entity in context of relationship so I'm going to spend a little bit more on relational databases and one of the big things that graph does over and above relational databases is it helps you avoid complicated joins we've all written them right I certainly have joins that go on for pages and pages that involve dozens and dozens of tables the graph databases are great for avoidance if you've ever written anything like that or anything like a self referencing self-referential join you know a table that is joined to itself to simulate something like a graph traversal like you might think of oh how about an org chart where there's the CEO there's people report to him or her people report to them people report to them and on and on for some organizations that goes on for quite a while you turn it you turn this from lots and lots of code and complicated code to a line or two in graph and in previous incarnations of my graph talk I've proven this out and and measured some some complex joints and so on won't do that here but that is very much a true statements relational database systems are great if you care about those individual records if you want to insert retrieve information about individual products they're great but graph databases again we're going to capture the dependencies and in a graph you can simply add new nodes and evolve your application now I'm going to pre up the question that you may have that I usually get around this time and that is what about what the relational databases are doing for graph well what they're doing is they allow you some of them anyway they allow you to point to certain tables and say there are my vertices and there are my edges and I have a few graph algorithms built in here and so off with you go and do it there's the data but the data is not organized in a graph way so the performance tends to be very degraded compared to a graph database and what I found is that some organizations that have gone that way have wanted to go further but the performance has been a limitation so in general I'd like to say that if it's sort of a low level maybe what your what your toes in graph kind of application sure try it out with your relational database and what they're providing with graph databases with graph but otherwise the graph database market is very robust and it's not again it's not like we have one size fits all anyway so graph databases make an excellent addition to most organizations so there are two types of graph first one I'm going to talk about as a property graph and this is what's called a domain model this is your modeling aspect of a graph database so you may want to model your data something that looks like this before you go plowing it into a graph database see what those entity or those domain relationships are going to look like this is Northwinds for anybody that's familiar with that and essentially here the vertices are your major nouns of the business here your things of the business and we see a lot of them here in this graph supplies products shippers regions etc so they're the major nouns now the attributes are on the edges so there is that to the edges being the relationships you have attributes on the edges and this is one thing is going to differentiate this type of model from the next one that I'm going to show you so the property graph again has entities which are called vertices or nodes and links between them called relationships or edges nodes and relationships can also contain properties and attributes these are not I would not use the property graph for things that have billions of rows and billions of nodes I should say and essentially what we're talking about there are events and there are many graph databases that capture events but most of them fall into the second category that's going to perform a little bit better at that level but the property graph that we're looking at here is very accessible very understandable it's close to what programmers are used to what you all are probably used to in the database world it's closer to a relational database it's easy to get started and you get some really cool graphics right out of the box and for this model we're talking about Neo4j we're talking about Tiger Graph and some others and the languages that they use are called Tinker Pop and Gremlin Cypher and AQL and I used to show examples of that I won't do that here but hopefully you get the idea now the other type of graph for which there are a lot more graph databases in the market because it's based on a published paper is called semantic or RDF graphs so a semantic representation has what's called triples and the triple is a subject predicate object and here we see an example of three triples triple number one is the subject is John Peterson and the predicate is knows because he knows apparently Frank T. Smith so that's your first triple simply put now in a RDF graph you cannot have attributes on the edges on the relationships and so for that you have to fork off another relationship or relationship from that relationship so triple two in this case is the subject is actually triple one it's the relationship between the vertices of triple one and the predicate is let's say confidence percent and the object is 70 so on and on and on and on and these are great for scalability these are great for events some are you might hear quad stores which also store fourth property of the graph but essentially it's the same thing as what we're looking at here this is a graph database using the W3C standard stack including the RDF resource descriptor framework as well as many other standards that will be described as we go on some of the examples of things that that you need this for this type of graph for that are really going to scale over a billion vertices and edges stuff stuff like world air traffic control global financial transactions world internet connectivity world airline and airport connectivity large social networks web traffic global trade global telecommunications networks and national energy grid stuff like that and the language used here is sparkle now I'll mention for the first and maybe last time today this concept called a knowledge graph so it used to be that people would refer to these RDF graphs as knowledge graphs no matter what you put into them but the terminology is kind of changed out there and again I don't like to establish terminology I like to help you with the terminology terminology that you're going to encounter out there and you're going to encounter knowledge graphs which are essentially graph databases and I would say of either type that contain knowledge about the corporation and I'll just kind of leave it at that you'll see the term kind of used and be used out there so it lacks great true definition and I won't go further with that now visualization this is one of the big benefits of a graph database it may look like a jumbled mess to you right now but if you knew what you were looking for here and you knew what these things were and as I scrunch my eyes into this thing a little bit it looks like these are people yeah these are people connected in some sort of social network think about LinkedIn or Twitter something like that and everybody's connected in here some people are connected to more people than other people obviously everybody's connected to somebody different we're probably connected on LinkedIn me and you by one or two orders maybe even three but we're connected and the graph shows this and you're able to navigate around you're able to pinch and and squeeze in there and get more information that way in a visual way and a lot of people just really like this now notice the coloring as well there's blue purple green yellow etc what the graph database here has done is it's seen that there are some commonalities here there are some clustered up levels of people and for my LinkedIn graph for example I might have a color for my teradata context a color for my IBM context a color for this a color for that and so on so it's able to intelligently cluster up your contacts as well which is a great thing now the algorithms they're so exciting and this is where I say that even our relational database might allow you to do some form of these algorithms but you're doing it on relational data which tends to underperform now I'm just showing you some of my favorites there's a few more but these will help you get a real good grasp on the possibilities again anything that has to do with a relationship should be should be in a graph database and taking advantage of these algorithms so the first one I have in the upper left is called page rank some of you may have heard of page rank very exciting here so what's the and I'll use page like web page but it could be anything so what's the what web page should google point you at if you're looking up something like I don't know data warehousing well it has to rank pages by that term and so it looks at pages with that term and it looks at the cloud that it has and it has more cloud based upon simply put what other websites point to it well it can't just be any website that's really going to give you cloud if you have high value websites that are pointing at your page that's better but how do we know how much high value those websites are well same thing what websites are pointing to it and on and on and on recursively for about 20 levels and you end up with numbers that like what you see here and that helps google for example determine what page to show you but it just helps prioritize everything and anything that you're putting into your vertices it'll help prioritize that and so it's great for that there's also other algorithms I want to share our closeness which help you understand the closeness of any two nodes on the network and also we have here in the lower right I'm showing you betweenness so what is the betweenness value of any node well that has to do with how much they are connector nodes between large clusters of other nodes on the network the graph database can find them we can kind of find them visually here but the graph database can highlight them and assign numeric values and so on so it's a wonderful thing here are a couple others cascading coefficient or cascading churn so my example here is going to be telecommunications so there may be a there may be a group of I don't know five people that I call on a regular basis so when friends are in on an opportunity they tend to share that opportunity so it's likely that if a friend of mine in my close network is going to churn off of let's say AT&T or T-Mobile what have you that I might get pulled along with that churn and AT&T or T-Mobile whoever would not want that probably so they would have some outreach to do to me but they wouldn't know to do that unless they knew that I was part of somebody's inner circle that was already making ways about churning or demonstrating through their lack of calling or their use of the call center etc that they are about to churn so cascading coefficient is really great for churning and preventing churning and eigenvector eigenvector let me say right centrality is another one that I want to share with you it's a measure of the importance of a node in the network now take for example the graph piece that I'm showing you here on the right side of the slide where you see some people that are apparently important they're sitting in the middle of a cluster of nodes they're connected to everybody but this poor person out here in the middle is only connected to three people so that must be an unimportant person right well no it has to do with who they're connected to this might be for example the sales manager the sales manager only talks to the account reps and the account reps talk to the junior account reps and so on of which there are many but the important person here if you if you kind of take a 10 000 foot view is that one in the middle so eigenvector centrality helps you do that now there are many other algorithms I'll kind of stop here for now but some of the others that are really cool are minimal spanning tree shortest path cycle detection and maximum flow so there's something there's an algorithm out there for you no matter what your graph workload is so how do you find your way into a graph database into one of these use cases what's the what's the common measure of the use cases I'm going to show you well these are them if you have these kinds of questions in what order did a specific set of events happen if you care about that a graph database may be for that workload are there patterns of events in our data that seem to be related by time are multiple things happening at around the same time how far apart are two nodes and how strong is their relationship what are the identifiable social groups remember the colors that I showed you before on that graph and what are the general patterns of such groups what do what do the groups do and what what does that mean that every individual in that group might end up doing do we like that do we not like that how can we prevent it how can we encourage it how important is any given actor in any given network an event hopefully that came through a little bit on that eigenvector centrality example I showed you what type of messages emanate from a specific area here again we're trying to show some commonality inside of groups maybe social groups what what is that group up to and so for that we need to know who's connected and how strongly they're connected I'll have an example of of the how strong how graph databases show how strong a little bit later but I couldn't help but notice as I look back on our graph implementations as I look back on the ones that I've been exposed to and read about a lot of them have to do with playing defense yeah playing defense for the company and by that I mean preventing risk or managing risk preventing churn preventing shrinkage things of this nature that you're just trying to make sure do not happen now there's plenty of offensive moves that you can have here also in terms of for example targeted marketing and so on will be some examples I would also say preventive maintenance maintenance of any kind really that's part of defense so graph databases are going to play a lot of defense for you inside of your enterprise football game so how do you identify a graph workload well when I hear these words network hierarchy tree ancestry or structure that then my ears tend to perk up and I tend to think maybe we're talking about something that is good for a graph database if somebody's describing their their business issue in these ways or if you are planning to use relational performance tricks or if your queries are going to be about pathing what's the best path how do I get there from here most efficiently or you are limiting queries because they're so complex maybe you're doing the workload in a relational database but it's gotten really complex when it comes to your queries and when you look at your queries you can see it has to do with all these relationship types of things and if you're looking for non-obvious patterns in the data you want the database to expose these patterns a quick POC with a graph database can be very impressive for all of these by the way and I do a lot of POCs or MVPs of all data technology and sometimes the graph database ones are the most impressive and I would say some of the quickest to to spin up for us so keep that in mind if you're thinking oh no it's a big mountain decline to to actually even show this maybe not maybe not these are some of the major graph databases out there just again trying to help you find your way into the presentation I'm not trying to advocate for any of these the relative positioning of the logos means nothing also by the way in case there's any vendors out there but let me just share with you some things about some of them Neo4j the the j is for java I don't know why we care about that but Neo4j is the most widely used graph database it provides asset compliant transactions and it has its own language called cipher and it's very much a property a graph database that's that first model that I showed you arango db open source multimodal database so that means it does other things it does document stores and key value pairs I had a presentation by the way in November on multimodal databases so if you care about that you might want to go back and look that one up youtube or dataversity.net as your cosmos db again a multimodal database offers global distribution auto scaling automatic partitioning low latency reads and transactional consistency good stuff tiger graph is another document store and it's open source an open source graph database with great native parallel graph computation automatic indexing real-time analytics horizontal scalability and it has its own language as well called graph sequel I really like some of the things that tiger graph is doing very much leading out there in terms of some of the innovation amazon neptune fully managed graph database service offered by amazon and there are plenty of others too you might be familiar with orient db titan flock db allegro graph yeah on and on most of them are rdf graphs so your maturity we'll see in the use cases perhaps the maturity model for graph databases looks something like this at the very lowest level you might just care about the visualization you know there might be users that want to pinch they want to drill in they want to scroll around that graph and that's great that's a great start there's a there's plenty more that you can get into though like those graph algorithms that I was showing you this is more complex functionality for example finding the shortest path between two cities so shortest path and showing a relative priority of the vertices that I can't under it I can't you know over overstate that enough people look at a graph and say oh I see the the relationships yeah that's important but also the relative value of each node in that graph is equally or more important depending upon the workload and finally you have graph ai and ml this is where you can really get some insights out of graph where it can find and identify sub communities within the graph and find little ecosystems that are in in a big old graph and identify the players in that ecosystem and what's going on there and what's going to happen again we're it's it's a lot of analytics here we're trying to predict the future we're trying to understand what's going to happen and understand do we like that or not so it's also this is also great ai and ml is also great for entity resolution which we'll see soon and for example it might identify nodes that if they're going to fail the entire network is at risk very difficult to figure out otherwise without a graph ai and ml identifying those choke points so graph machine learning it's really training probabilistic machine learning models based upon the graph so graph can provide lots of great input to machine learning algorithms otherwise in your enterprise so let's talk about some of the use cases now the i'm going to start out by just showing you some categories of use that can be applied to multiple verticals and then i'll get into some vertical examples so the first one here is fraud detection like i said we're going to play a lot of defense with graph databases so fraudulent activity frequently involves a number of parties well beyond the victim there's middlemen galore potentially and fraudsters all over the place a graph database can recognize and connect multiple for example email addresses that are used by a fraudster to create fake accounts for instance you must be able to spot fraud as it happens so those queries have to be really fast because the graph can process large amounts of data in real time it makes it's it makes itself ideal for fraud detection so storing all historical transactions thinking about which ones were fraud looking at the patterns bringing those patterns to bear on transactions happening right now and determining what you want to do about it that is fraud detection and a graph database has all of those characteristics entity resolution this is another category very important category for graph it's the ability to identify and resolve different nodes to the same entity for example occasionally my last name of mcnight means that companies get confused by by my last name without sometimes with sometimes without a space after the sea sometimes capitalization is important and and so on and so that this really shouldn't be a problem anymore but it it still can be an entity resolution and help with something as minor as that but also something as major as contributing to the fraud detection that we just saw so it's actually quite a challenge to turn all this information into what two what two nodes or or 20 nodes are the same entity so if you want to figure out hey are those entities here actually the same and there's no single identifier that we can use to do that so it's really a wide range which is why it feels so so fundamental entity resolution you have multiple choices of how to build that in the end you can write a very complex SQL inside a relational database to give you that kind of context that might be pretty cumbersome and hard and let's face it if something's hard it either doesn't get done or it's very hard to maintain and so whatever it does finally get done we tend to walk away from it and say okay it got done i don't want to see that again for ever on the other hand we can also use graphic analytics so we have some customers who actually use that to do entity resolution now the louis van method not on the slide but that's an algorithm to detect communities in large networks and resolve nodes to a community it creates a modularity score for each community where the modularity quantifies the quality of an assignment of nodes to communities so this means evaluating how much more densely connected the nodes within a community are compared to how connected they would be in some sort of other random network so it really helps to identify communities and the participants in that community the louis van algorithm is a hierarchical clustering algorithm that recursively merges communities into a single node and executes the modularity clustering on the condensed grass and i did not give louis van justice with the french pronunciation there but hopefully you can identify it when you see it this is great for think about entity resolution it's great for cyber fraud financial services offering the you know right next best product for them the supply chain making sure you got your product set straight customer identification making sure you're identifying the customers correctly so a lot of enterprises are using graph for this now here's another kind of general my last general one actually network attack and this applies to anybody that has a network so you've identified an attack on your network and now you want to find similar patterns in your graph and this would look like a similarity algorithm being applied to the data which we saw some of those earlier whereas we don't just look at the node itself but we look at the nodes of context using a graph database to support a network attack prevention use case would involve designing a graph schema to represent the different nodes and relationships amongst the entities in the network and we're talking about switches routers etc store the data regarding the traffic flow use analytics and machine learning techniques to identify malicious activity uh and so on so again you can map your network into a graph and then you can map the activity on that network into the graph and that can help you identify what's the good what's the bad activity here and most importantly what activity do we want to allow and or do something else about as we see it now healthcare fraud one i've had the pleasure to be involved in one of the implementations for with graph it's actually quite exciting now there's a lot of different types of nodes on one of these graphs they're not homogenous at all very disparate so you got your drugs and treatments you have prescribers and consumers doctors pharmacies medications and so on and so what we see here is an example of you see the lines between the nodes and some of them are thicker than others and it's called out here as excessive relationships yes too much merits further investigation and this is where this is the type of graph that we use to identify what the pill mills are i'm sure you're all familiar with that those pill pill mills will dispense certain controlled drugs like candy and that can be a problem on many levels as you can imagine so with a graph monitoring all the activity in the network we can see where the line is getting way too thick and do something about that so that's just one example there's there's actually a few cases of healthcare fraud that graph is grateful online shopping we all do it and a lot of that fast context is brought to your shopping experience you might like this or this goes with that kind of thing that comes from a graph database and in order to do this you need to be able to or it needs to be able to recall past similar interactions not just by you but by people that are like you and who is like you well that depends on different things depending upon the context of the interaction sometimes somebody that lives across the street from you is someone like you no matter what they look like and sometimes it's somebody on the other side of the world because they look like you so it all depends but identifying like people is one one aspect of this and then identifying the right product to pitch show etc etc is the other thing so you need probabilistic models so this brings to bear your product catalog which is frequently in a graph database and also all manner of shopping attributes so also when a shopper searches for something like red purses for example the app also knows what details to ask about next because it knows all the attributes that are associated to red purses what kind of red how big for the purse what type style brand size what's your budget these are all nodes in a property graph for that property graphs are behind a lot of online shopping experiences also as it accumulates this information by traversing through the graph the application is continuously checking inventory for the best match you can see there's so many application of graph databases this is real time decision making and this also applies to something like uber eats something like that where it's essentially online shopping but it's for food and that's become quite big lately now a major insurer another another use case I had the opportunity to work with so we're trying to play defense again get insight into the risk environment what risk risk like people appearing in multiple policies and claims is so much fraud that can go on here premium leakage i.e. you're underestimating the mileage undeclared drivers false garaging yeah I really live across town but I'm here you know that sort of thing all sorts of pre-existing condition type of things as well padding claims policyholder graph with the risk indicators is what's called for here so bringing all the customer data into a graph database allows them to reveal the true risk exposure and detect uncovered risks or overlapping coverages in particular in a motor or household context by starting with their location the address they live in and the other people that appear to be living in that address working at the same workplace connected on social people you're hanging out with etc you can quickly start to build a picture of the kind of relationships this person has with other people and that is something very important and there are various ways that we can identify behind the scenes who is in a network who you hang out with and so on obviously social is is a big part of that geolocation is a big part of that the use of credit cards at the same time in the same place that's another part of that so insurers can start to see what your network is they bring in third party data for a lot of this and it's going to drive a lot of a lot of coverage especially as we go into the future television magazine and media analyze content and consumption for personalization this is the personalization journey so one company for example began leveraging a detection algorithm called weekly connected components to find subgraphs within their multi-billion no data set this is a huge media empire that can be attributed to distinct profiles they then use the more accurate profiles to create distinct audience segments which is the holy grail of any media properties advertising business identifying audience segments down at a granular level now nielson happened to be a former client they are the leader in measuring digital audiences guiding around half of the advertising sales tied to streaming platforms but many streamers like youtube etc use their own yardsticks and so you're not you're not just using your yardsticks to gather the information you want to figure out what to do with the information so what's the next best thing for netflix to post to you for example and handle you in that way preventive maintenance and in a car example the radiator in a car which is actively dissipating the heat that builds up in the cooling system as the coolant runs through the radiator the walls of the those passageways start to develop a some residue and debris running through that cooling system may also cause a blockage and this is not a good thing when this happens the radiators cooling abilities decrease and you're likely to overheat so when one component fails to work properly other parts throughout the cooling system also run the risk of failure well what other parts well based upon your history of failures and your history of maintenance and what you know otherwise which can all be plowed into a graph database those parts that commonly cease can be identified so let's say after a radiator goes bad the thermostat the water pump and the heater core but with graph IDs with graphs you don't need to know this the graph itself will identify what the similar parts are pharmaceutical research very exciting very exciting things going on here these these companies deal in easily in the billions of nodes they need to share their research throughout disparate parts of what commonly is a large company to increase the research and operational efficiency increase the output and accelerate drug research they really like the visualization capabilities here of graph the scientists are using it a lot and think about one of the things that they're working on very hard these days and that's DNA think about that data set that is a data set I think that's a a strong data set for the future strong third-party data set that that is coming and I've said a lot about this otherwise but I'll just say about it here that DNA contains all the genetic information necessary for an organism to develop function and reproduce DNA encodes the information as specific sequences of nucleotide bases scientists have found the first genetic instructions hardwired into human DNA that are linked to things like being left-handed for example and it'll go on and on from there cystic fibrosis sickle cell disease and so on so overall this is a great thing and but you have companies that need to share research from disparate parts of the DNA each company is working off of or each aspect of the company department if you will working off of different DNA segments person has a total of 46 DNA segments 23 pass from each of their parents the DNA letters that mark the spot of the cystic fibrosis for example it's that gene is one out of three billion letters in the human genome just impossible to deal with on a human level so a lot of those are in graph databases and finally anti-money laundering this is also an exciting use case again we're playing defense sorry about that if that's a problem but we have to do it money laundering conceals the origins of illegally obtained money so this might be through insider creating drug trafficking kickbacks and extortion are examples of crimes that require laundering large sums of money by principles through agents and the agents might be individuals corporations financial institutions or law firms and the graph unlocks the wealth of insights found by pattern matching on connected people companies financial institutions places and times in a financial network here again we're using the graph to identify the close relationships so two entities for example might make payments to similar counterparties and may be affiliated with the same legal entities these two entities may be associated directly or indirectly with entities that are on a watch list maybe they have unknown ownerships which are red flag or are located in high risk geographies a counterparties initial profile might be limited to a few things like party name address bank name etc but over time the business processes enhance these party profiles with third party data information related to transactions and account activity and details learned by investigators of flagged transactions so millions of entities must be resolved in real time for billions of transactions daily and this all must be done very quickly and these they're called guilty buy association algorithms and they include customers that are associated with watch lists like regulatory and law enforcement negative news coverage watch list global and narrative sanction lists politically exposed persons high-risk individuals legal entities with unknown ownership counterparties in high-risk geographies and banks in high-risk geographies so let me get you to some closing thoughts and bring in some other things that are top of mind out there like LLMs what about graph databases and LLMs graph databases combine domain specific knowledge from a graph and general knowledge from an LLM by using relationships to link the two for example a graph might leak an entity's properties from the knowledge graph to a definition of that entity from the LLM in this way the domain specific context from the knowledge graph can be augmented with the general knowledge from the LLM to provide a better understanding of a given situation LLMs now let's talk about vector databases as I look at the emerging vector database marketplace and I'm looking at it a lot these days I think that there's a there's some risk here to graph databases right because if you think about the capabilities here think about a Venn diagram and I'm not going to try to say how big each of the circles are on this diagram but there's graph and vector and there's definitely some workloads today that you could go either way with however however my guidance today is that they are best suited for different types of workloads and I'll get to that on the next slide let's talk about vector databases in and of themselves so this might be you might have heard of pinecone you might have you might hear about what data stacks is doing what mongo is doing what elastic is doing in this area and it's like taking a a vertice or let me use more general terms it's like taking it an entity and and breaking it down into a bunch of numbers how many maybe around 100 to 300 per and this this process is called graph embedded and to me it's kind of like preloading the entity with all manner of analytics that might be useful that might be hard to do otherwise and not that just preloads but it keeps it up to date and so a lot of that information is it has to do with related entities and so vector databases are really good for similarity search we find them in machine learning recommendation systems and similarity search algorithms graph databases do do some of this as well right but they're less ideal for managing very highly dimensional data and they're not as scalable in this way as vector databases items that are near each other in this embedding space are considered similar to each other in the real world and in our businesses embeddings focus on performance not explainability so embeddings are there for high performance similarity search graph embeddings usually have around 100 to 300 of these numerical values think of it as an array and the area around the vertex is used to encode in embedding that's called the context window some embeddings might only look at customer purchases from the last year to calculate an embedding other algorithms might look at lifetime purchases and searches that go back since the customer first visited your website but these embeddings do take up valuable RAM so we don't want to go too crazy and embed things that we're not ever going to use in a comparison when we want to focus on when similarity calculations get in the way of real-time response for our users that's what they're about and these graph embeddings can be used as an additional tool to increase the performance and the quality of the graph algorithm so in other words you can use both in the same workload you can use the graph algorithms to come up with relationships and that can be fed to vector databases for more so specifically breaking them out here graph databases are better suited to processing data with complex relationships whereas vector databases are better suited to handling high-dimensional data such as images and video graph databases are made for queries involving relationships while vector databases excel at similarity searches graph databases utilize graph traversal techniques to discover associations between nodes vector databases you algorithm like k nearest neighbors to locate comparable vectors vector databases excel in handling complex relationships and interconnected data they are particularly useful in scenarios where the relationships between entities are of utmost importance such as social networks or recommendation systems now i said some words there that you've heard before in this presentation that have to do with the graph database so again this is my direction for you for now i am going to watch this space you are going to watch the space and see where vector databases decide to put their energies i don't know that they're necessarily in any kind of short order looking to do all the things that a graph database does i tend to think not there's so much that they need to do around their coordinating right now but here we have another yet another data platform that could potentially have value inside our enterprises i might have added graph databases to you today and vector databases which clearly merit its own webinar its own hour might be yet another what do you know okay in conclusion graph is a fast growing data category it's all about the use case good for graph we saw some of these real-time recommendations fraud detection and risk network and it operations entity resolution and identifying relative importance and we spend some time differentiating with vector databases graph databases are made for queries involving relationships while vector databases excel as similarity searches and i just have a quick minute for you here before i get to your questions if you have questions toss them in there i'll give you a minute to do that while i show you that we've covered a lot of ground this year already in in everything really but also in this webinar series there you see some of the ones crossed out they are available at mostly at youtube also at dataversa.net if you want to look back on anything including this one in a few days but coming up in the next few months next month i'm going to talk about common misconceptions about master data management i'm going to touch on organizational change management open source versus commercial data quality and strategies for machine learning success before the year is out and we're already over halfway what do you know this brings me to the end of the formal part of the presentation and i'll turn it back to Shannon to see if you have any questions William thank you so much for another great presentation and so nice to see so many people on the webinar today who are on all day with us yesterday as well i love that and there's a suggestion from work on a webinar on just vector databases alone which i think is fascinating i think we should look into that um so William diving in here lots of questions coming in super early even so uh have you encountered firms using KGs to help under knowledge graphs to help understand systems systems as being the software applications that produce the data in the databases did you say pharma was was that your word sorry did you encounter firms using yeah help in knowledge graphs to understand systems um yes as part of network analysis so systems are thrown in there with everything every component of the network and when you're monitoring your network you usually have your systems in there as well that are being monitored as well by a graph database so systems are a a vertices type if you will in in your graph model and definitely have a strong place in those types of graphs nice and William um i'd like to hear about the supporting information architecture activities components that are addressed to support the data and metadata of graph databases versus say traditional relational or hierarchical databases either during this call or where i can find those best practices um i think that could be a whole webinar but that could be a whole webinar um lots a lot i mean i didn't really get into best practices here um but uh hopefully i got you started and and and you'll you'll uh you'll know how you'll know what kind of the end game is that you're trying to get to so there are those actually there are some design decisions in graph databases there there really are there's there's no letter uh when you move from relational to graph in terms of design decisions some people like to jump right into creating their triples and and not having that domain model but i always encourage let's build the domain model know what we're doing not not even though we can put anything in our in our graph let's not put anything in that we're not expecting that doesn't have a place in our model so maybe that's the old model first me coming out but uh but yeah you can definitely do that here and i think it's a good thing indeed yeah and we certainly have some resources on our site in addition to that so um but diving in further here william what tools support the metadata for a graph database um data catalogs do an okay job with it but it's that's more of a it's not an not a immediate uh port for for most of them so that's kind of a down the line enhancement that a lot of the catalogs are doing so i would say for the metadata unfortunately it's got to kind of look within the tool within the database itself within its own catalog and uh there's some information there there's information on the nodes on the placement of the nodes on what's being discovered automatically uh by the database itself running autonomously and things like this but it's not really in my view and maybe i just don't focus on it enough but it's not really rich in in a lot of metadata that can be shared to other systems in case that's where they were going nice and we've got about three minutes left so i'm going to slip in another question here if knowledge graphs are built around noun relationships how do they address the issue of many names nouns for the same data thingy i love that technical term yeah yeah yeah um well this gets back to the uh that the modeling that that i recommend we do beforehand where you where you reconcile all that because if you end up with a graph that has that you're identifying multiple multiple vertices when they really are the same then you're just asking for bad output and bad relationships to be developed and probably not the relationships will be as strong as as they should be because you're you're watering down your vertices in that way so just as we want to use graph databases to identify similarities uh in our in our nodes we need to do that before we get into the graph database and make sure that we are implementing no no synonyms no antonyms no homonyms none of that stuff and one other some recommended resources to learn more about graph that you that you recommend i think several of the vendors have have good information like this um uh i'm a neo j uh kind of person so uh i get a lot of good information there they have a lot of good presentations uh and they've invested a lot in education for the market so i would say that's a good place uh tiger graph also has some good things really really they all do so so if you're interested in any one of them i would start there but you might we might find your way for some basic graph information over to the neo4j site yeah i know they've spent a lot of time on on education and and as most of them have as well well william i uh and everybody thank you so much for another great presentation but that is all the time that we have for today's webinar uh you guys are just amazing again i love seeing so many people on here today that were on here all day with us yesterday for data architecture online uh and just a reminder i will send a follow-up email to all registrants with links to the slides and links to the recording by end of day monday for this webinar and uh so we'll get that out to you thanks again william thanks everyone