 My name is Aran. As was mentioned, I'm a senior software engineer at Intel. I'm seeing myself, not the slides. I've been working the advanced analytics department for the past eight years and today I want to talk to you about the solution recently developed for our sales and marketing at Intel building and maintaining a knowledge graph. A little bit about myself and where I come from. At advanced analytics at Intel we build and deploy machine learning and artificial intelligence solutions across many of Intel divisions from the manufacturing process in order to improve quality and reduce test costs up until sales and marketing in order to increase revenue Which we're going to dip dive a little bit about today. We also have products facing the outside world such for the healthcare industry helping Parkinson's patients analysis and for them Automotive industry when we work with mobile line as an analyzing automotive Automobile information Advanced analytics composed of 200 people from data scientist machine learning engineers and product managers But today I want to talk to you a little bit about Sales and marketing because this is where we deployed our knowledge graph. The sales and marketing department at Intel is huge they have Thousands of people covering hundreds of thousands of accounts and they can offer them hundreds of products from CPUs to memory and RAM and This enormous complexity may lead to some missed business opportunities This is why we created the sales AI platform in order to bridge the gap of those missed opportunities and help by collecting data customer data in real-time detecting intent to buy and potential opportunities Recommending the appropriate response and recording the results back into our models in order to make them more robust In the past five years our sales AI platform have increased Intel's revenue by more than 450 million dollars We've deployed many solutions for account managers and to give you a little bit context One such solutions is called the sales assist as it assists account managers would given in leads about the respective customers The sales assist provide additional activity information about customers With relevant information for instance the account manager can look at the screen and see that company a Have recently viewed the product that it never purchased before on the other hand Company 234 is planning to attend the conference. This could be very viable information But through the years we created many such solutions in different parts of the sales and marketing teams And we started thinking wait a minute if you want to see everything we know so far about company a Can we or do we have only these silos of information? Can we get a more broad of you and the answer was no We don't have everything centralized in one place and we're losing this Feeling of unified data So we decided to create a centric data source a knowledge graph that would unify all our external data Internal data and new insights that we learned along the way So I specifically knowledge graph and not throw everything in using a document DB. Well, everybody's talking about it So you probably know that knowledge graphs Are a great way to represent? relationships They are very explanatory you can give the motivations behind your recommendations explained it to people and normal sentences You can infer knowledge directly from the graph as long as you keep it up to date and valid You can run the same algorithm and quickly get new knowledge about a specific entity and fast Because it's extremely efficiently managed and Everything is close is one hop two hop three hops away and you can always store back The insights that you just collected. I just realized that the slides are not my list as the latest Slides, so I hope I won't lose track So we decided we wanted to know a knowledge graph But how does one create a knowledge graph? We started thinking can we take all our accumulated knowledge and simply turn it into graph formation? This seems like going to a lot of data sources and creating graphs from data, which is unstructured most of it Seems like an impossible chore So we decided to start now meaning we would choose a use case and We would record any new activity and insight into a knowledge graph structure And I'll illustrate by the first use case we chose to cover a Company enrolls into interest program in order to buy products While it enrolled into Intel's program it mentioned that it's part of the NAR IoT industry And that's it's a builder in those industries, so we can record that Later on company a continue that purchase product X and product Y We know that product X and product Y because we have the product graph structure belongs to the data center Later on we browse company a's website and we discovered our smart medical devices Which suggests that company a actually belongs also to the health care industry Then there was a social media tweet Company B is about to invest in company a so we can connect company a with company B and Fortunately, maybe we already have company B in our graph database in our knowledge graph that is and We have a lot more information about things that related to company a Company a continued and browsed Intel calm and we recorded that activity to showing that work There were seen interest in cloud related products We can continue on and on by recording multi-source activities Enriching our graph with external data Relaying on the basis of our internal data and what's amazing about that? We got a really dense picture of information, but we started only with this from our internal data sources and as things Feel like they're flowing in it was only natural that we would create a stream processing framework that any new piece of information That flows in would be persisted in a graph formation and We did that just that we created an asynchronous processing framework for growing a knowledge graph as We've just seen we thought what are the capabilities we would like this system to have and We thought about four main capabilities that we found crucial that the system would possess and we created a component for each one of them The architecture by the way is a microservice architecture and in high-level is fairly simple It's written on top of Kubernetes for easy deployment. Everything's got a nice Docker file. Everything is in a pod easy to deploy and the components are these for loading internal information That we do all the time we created the loaders Then we knew we needed to transform information Into graph formations so we created the transformers But that was not enough. We wanted external data. We want to go out to the world and get enrich our data. So we created the extractors Last we wanted to save everything into a graph database in our case we chose Neo4j So we created the graph builder Now on top of that in order to keep data relevant or data relevance that is we created the refreshers and Everything that goes into the system must pass the message bus. So we created an ad hoc API for analytical models to have access To the system in order to add their own edges or update existing edges and nodes And now we're going to talk a little bit about each one of those components So the first one is the entry point to the system the loader its entire role is loading data from Internal data sources such as CRM's ERP's File systems and it does just that it supports a variety of formats and it can work either with push or with pull It can periodically pull from a sequel server in the CRM systems Or it can be notified by push from the file system if a new CSV have just arrived. It transforms those records into jason's Format and pushes them publish them to our message bus to Kafka And it works continuously in order to insert data. This is the entry point to the system The basic processing unit of our framework are the transformers They do the heavy lifting it sort of speak they transform Information into graph semantics into graph formations They take product information and transform it into nodes and edges Their asynchronous they're always on always working their stateless Which by nature makes them extremely scalable and full tolerance We use Kafka streams in order to build the transformers each one has Dedicated Docker file, which can be extended and they're configurable entities, which mean you configure a Transformer per specific entity for instance in this example product you configure the transformer that every specific field from the information that flows in what Kind of graph semantics it becomes a nodes an edge that connects two nodes Etc. We could and we started training automatic models in order to do this transformation But we decided to start with a simple configuration In order to do labeling of those transformations to later on replace them by machine learning models This was not enough as I mentioned before we wanted to get external data So we extended the transformer and Created the extractor the extractor Enriches the data with external information goes out to the world and get more information about that specific Product it is configured on top of entity like the transformer. It is configured With the data source as well each data source needs to have its Implementation of a data source interface so to speak this is kind of a plug-in system if you want to get external data from Wikipedia You need to implement how you're going to access Wikipedia and what's the post processing going to be light Because you do not want to get all the information about that specific product from Wikipedia You want to do some post-processing Maybe run some simple machine learning models in order to get some reasoning before you transform the external information Into nodes and edges and publish it back to Kafka To later on be persisted to the graph database The graph builder that's the most busy place in the system All pieces of information gets to the graph builder To be saved into the graph database into narrow 4j. It translates Graph semantics Jason's in the format of nodes and edges Into gql graph querying language. We support cipher as it's the most prominent one and recommend supporting cipher It it's actually decouples you from the graph database technology and you can always Replace the graph database though. We chose to stick with narrow 4j for now It's important that the graph builder is the only one we write access to The graph database. You want to control this concurrency in order to keep things under control You do not want any other process to have access to your graph database Persisting nodes and edges left and right So why did we choose narrow 4j? Well, first of all, it's extremely mature and stable It's easy to set up and use But most databases are these days with Docker and Kubernetes It uses cipher as I mentioned great benefit but for me The most Important thing it has a great ui and this is actually a snapshot from our knowledge graph And i'm saying great ui because if your graph database does not have ui It's really hard to explain it if you want data analysts data scientists Developers everyone that want to get a glimpse of the graph database without writing code and in ciphering the results you should have an icui to Show it and explain it So this is basically why we chose narrow 4j The last piece was well, we're going out to the world. We're getting information But how can we keep it? Still relevant. So in order to keep data relevant We created the refreshers their entire role is to trigger extractors They're the only one that query the database since they slowly traverse The graph looking for stale nodes Once they found a specific stale node. They triggered the relevant extractor In order to get new information and keep this node up to date And they are also similar to the extractor configured paired node pair entity and pair data source Hmm, okay. This is not why I wanted to get but we'll do this first um One key principle on how to save data into the database was keeping information closed yet not coupled This is a principle we found very useful when actually creating the structure When you ask, um, how do you save? Data as graph structure usually the response is Well entities our nodes and actions our edges Well, we found a few more as I will show In the slides And I'll illustrate company a as we know it from before we scanned its wiki page And we have identified that this wiki page Suggest that it belongs to the cloud industry now we can do this But this actually couples company a To something that we learned from its wiki page and we don't want that Because maybe later on we will learn we will learn something new for that from that same wiki page In general, you should always keep the source that your data came from And the confident that you have in that new information So it would probably be better to do this And save a note specifically for the data source later on if we want to run models on all the wiki pages Inferring which industries they suggest We will not disturb company a as we run models on the wiki pages And company a simply Belongs to the cloud industry Through that wiki page on top of that If we found two wiki pages for company a We can simply do this and add a matching or a Confidence score to that edge And later our algorithms can decide if company a is actually a cloud Um industry company software both of them or none So we had that nice infrastructure and we started thinking what do we want to do with it? And one of the first use cases we put in is we said it would be nice to see our customers the sales What the customers buy? And their partner companies So how would we configure such a thing? Well, we would first need the sales loader Which will load data from the erp systems Then a company loader that would interface with the crn systems A product loader that would law periodically update into this product line into the knowledge graph A corresponding transformer for each one of those loaders And then a partner extractor in this case the partner extractor simply takes the url We got from the company loader Crawls the website goes to the partners page Run some image recognition in order to identify logos and relate them to companies Filters Unwanted data will only keep partner names and urls transform them into nodes and edges And sends them to Kafka in order to later be saved by the graph builder And after configuring the system this way We got this and this is actually a snapshot. I took two weeks ago from our knowledge graph Which illustrates why unifying external and internal data can be nice and maybe create new business opportunities Here what we see is that we have four entities. Basically, we have company We have url. We have sales and we have product and we can see that we have two Intel customers, which are The orange circles which are connected By the fact that they both bought SSDs Which are part of our memory and storage product line Each one of them also have partners This company with that url has two partners and the edges said isn't in business with And this one has got four. So maybe this Nice view can create new business opportunities with the partners of the companies that already buy from us So that was all well and good Let's take a look at how we created this structure again with the principles of how to save the nodes and edges in what kind of formation so our another guideline that we had was every piece of even if it's an action But it relates more than two parties You will create a node from it and you can see that in sale And basically you can see that also in url, but it's say in sale is stronger because you could create company x purchased product but then You will never be able to add Who sold it? How many maybe how many on the edge? But if you want to relate it to another node, you simply can't so it would be better to create nodes for the sale transaction as well So that was nice to see this unified information external data and internal data But then we wanted to know if we can generate new knowledge on top of that data and when you have a strong knowledge base and when you have a verified knowledge graph You can create and enrich that graph simply by running analytics models Without going to external sources and this is a paper in in in the process of being published um in the coming weeks where we did customer segmentation to vertices to verticals that is Using the knowledge graph and these specific machine learning and deep learning models have relay on Wikipedia data and Intel activity and web pages In order to infer these verticals and this was not possible before we had all that information in one place So if you find this intriguing you can probably find it in google scholar in a week or two So this was our system and I hope you found some of the things that we learned beneficial And I believe I'm done And I'm open to questions Yes, and I apologize because I had a few more slides, but this was a little bit Out of date Thanks for the nice presentation and I really like the architecture that he used And my question has to do with the data model that you kind of touched upon at some point. So It's a two-part question part one How did you get yourself up to speed because indeed as you pointed out there's quite some some trade-offs to be made when designing Kind of graph data models. So how did you educate yourself on that? Did you just learn by doing and b which is a kind of more specific question you mentioned that For a specific kind of relationship you basically create a new a new node every time you have an instance So I'm wondering how many instances you currently have in your model and whether you see that approach scaling well Okay, the first part It was hard. We started reading. We we tried reading a lot of Material papers about the subject. There aren't many clear guidelines We gave our data scientist team To actually reverse engineer what they needed by the algorithm. They would want to run on data And basically they started mapping and since we did that iteratively and we didn't do everything in once Each time they came out with a new mapping And we did just that we put the mapping and sometimes it changed It it wasn't right on the first in the first try. So there wasn't any magic here um second thing Remind me the question. I really forgot that Said that you you made a specific design choice To to model a certain relationship which was to instantiate a new node every time and I was wondering How many instances you currently have in your your knowledge graph and whether it's scaling up this whether this specific choice Is scaling up well so far and how do you see that playing out? So far it scales well. We have I believe around Tens of millions of entities in our knowledge graph Maybe even more if you add all the edges This way that holding things in nodes makes it easier And this is our personal opinion playing around with things Because an edge has more of a It limits you more than a node. So this is why we decided Basically everything that we could make a node. We made a node because then you can specifically now Pull out a view of only the nodes of the type you want and play with them and If you map the nodes to external data sources, then I can say well now I want to work only on tweets and Poof, I have only the tweets because I have nodes of all the tweets So it makes it really more neat and organized. So that was the idea Hello, thank you for an amazing presentation and My question was more on a bit different lines Why did you use knowledge graph in the first place? Maybe you could have used a multi-dimensional graph or something A what a multi-dimensional graphs? So plotting each of these for example companies Twitter based information on a different level of graph When you say graph now you you said multi-dimensional graph what I see Do you mean multi-dimensional like relational? Yes data structures layers of each of these graphs connected amongst each other Okay We wanted that That was the first motivation. I don't know if I mentioned it that if an account manager wants to see An in-depth view of a specific company. It would be easy To take that sub graph fast in real time and get all the information around it It's similar to what google is doing with the knowledge box You want a specific term you want to get all the information related to that term and you can get it in milliseconds only if you if you use Graph structure if you start joining data from different sources, it will take time And it's it's not scalable when I meant multi-layer it would still be a graph, right? So I don't know what you mean Okay, um, yeah, I was More talking in the terms of So the company is on all the companies are on one layer of the graph and all the other factors about Some extra data that you had that comes from twitter and all on another layer So what why would you want to create a layer? Like a tree Yeah, kind of why would you want to create a tree? To represent your knowledge Maybe I thought it would be again easier to pull if there's a hierarchy But the the important party was the relationship not the hierarchy Because there is no hierarchy like there's nothing it's not that twitter is in charge of companies or Twitter is above companies or companies are above twitter they interact with each other So I don't see the the value in a tree, but maybe I'm missing something totally So I'll be glad to talk about it after This thank you many. Thanks for your presentation. Um, my question is about with all this Data, how do you control the veracity of this data? After I don't know months years Because you are using this data as you mentioned for the sales assistant So how do you control the veracity of this data? How do The quality of this data that uh, you're, um, Refreshers, so I don't remember exactly the the terminology are Gettering the right data Okay, the refreshers. They simply Triggers extractors which go out and take to that an as each extractor has a sort of a post processing logic And this confidence Is is done in in that stage meaning we do not simply go out to wikipedia get the entire page And and save it We we must first of all make sure how Confident we are that this company and this wiki page are indeed the same thing So we run algorithms For each one of this matching a matching algorithm in order to give it a score So that's first and later on because we keep it in a different node. We can always say well now I want to run a different algorithm on all our wiki pages And try to give a different matching score Later on and this is because we separated it into nodes. We can do that later on But we always run algorithms after we extract the knowledge in order to validate it You don't have a human intervention that is Reviewing the data or the graph No, no, no, it's not it's not maintainable. It's too much. We persist The last time I checked I think I don't want to commit between five and ten thousand Entities per second So it's not huge But it's it's a lot Thanks, we have time. So thank you very much