 Thank you. I've been informed multiple times to stay inside the egg So that's what you need to do if I get out of the egg and get excited Bring me back into the egg and thank you very much for the introduction And I'm going to be focusing on a part of the entire data pipeline That is often not really focused too much on the data preparation phase The idea being what's all of the work that we need to do on data before we give it to you to fuel use Cases like machine learning or advanced analytics. It might be business intelligence where to be honest a lot of the fun happens and What I'm going to do to highlight this is I'm going to focus on one small part of our data preparation Pipeline that we've been working on and and I want to focus on how we used a graph store in this case It was Neo4j To act as kind of a persistent store to store the state of how we would build a decision engine That would help us automate a large portion of this data preparation phase after all they do say That in any data project that we work on 60 70 80 percent of the time is actually spent on this data preparation phase So this is me when I'm actually formal with a suit on and everything My background is software engineering I've been spending the last 14 years as a software engineer and The last six of those mainly focusing on the data space And I probably have quite an odd skill set compared to a lot of people here I'm a classically trained like Java and net Software engineer is the last six years that I've been focusing on things like graph theory and Search and text analysis and more on the natural language processing and ml side. So let's dive in So what we do include in is we help large companies build a data foundation. What is that? Think about think about every data project that you work on There are some common pillars that you go through to solve that use case Every project has a data integration piece to it everything has a data preparation and data wrangling and and Massaging to it and of course we're in Europe. So things like compliance and governance They do fall into those pipelines, but are often one of the last things that we think about So I guess what you could think about is what we do is include in is we condense and abstract These common pillars into one single platform make it simple to deliver data to you to actually add some type of business value So then why a decision engine? Why did we think that we needed to build a decision engine for our customers? In the big data world they often talk about the three V's volume variety and velocity and I can honestly stand here and after working with large enterprises say this is just the tip of the iceberg What I think the data big data era did is it also exposed us to the amount of technical data debt that we have as companies Not only that but it's also exposed that the ways that we solved data before big data came in Just do not scale So the core concept was how could we let the data work for us? And one of the design principles that we came up with was how could data take on quite a Biological kind of organic form to it where it could do the work to Self-link self-discover connections Self-enrich self-clean So let's go into how we did it We had a simple idea We needed to be able to build a decision engine that could take data in and Automatically prepare it as much as possible For this we realized very early on we needed to persist data. Why because data doesn't stop flowing There's no snapshot in time where we can think we've got everything we can now start making our decisions So this required the need to persist Whatever this data engine was actually building Knowing that maybe in a month maybe in two minutes maybe in a year We would get more data and insights to help us make a better decision on preparing data Of course coming with this from a performance perspective We were Mandating that it needed to be parallel in nature there need to be an asynchronous nature to this decision engine that could start To spawn off different decisions asynchronously Fortunately, we had one ace up our sleeve, which was this did not need to be real time I mean I think most people would agree that when it comes to data preparation data quality Using data to fuel use cases that I would much rather have high quality prepared data than something that comes to me quick and that I need to fix and clean up at a later point and Then there was this kind of odd concept of Could we gain context where context didn't exist? So where data was ambiguous in nature, could we actually Enrich it to a point where we could get something out of nothing Why even do this? well apart from the fact that I've alluded to just the classic ways of data preparation do not scale around the Shear of three V's that I alluded to in the big data movement, but on top of that It was to be able to disseminate noise from value. You can imagine that in a big data world whether it's Tracking transactions, whether it's customer data, whether it's analytics that there is still a lot of noise that exists within that data So could we use this decision engine to disseminate what was valuable what was relevant and what was not? And for those of you who have worked with graph theory before and graph components and maybe some of you have actually worked with Neo4j itself The reason why we picked this persistence store Was literally because it matched the type of data structure we were wanting But what we realized is we could also leverage a lot of the power that comes with graph theory to prepare data Something simple as figuring out two nodes in a network. How are they related to each other? We could lean on Shortest path algorithms to be able to figure these types of relationships out with the end goal being Could we actually build a knowledge graph a rich strongly typed knowledge graph based off raw source data? That's what this decision engine was there to do and As I alluded to before let the data do the work If we enter a big data era where we don't let the data do the work You will constantly be fighting the amount of velocity variety and volume of data Before I head into kind of the complexities around the decision engine It is worth mentioning and maybe this is my bias from a software engineering background That the simple techniques in data preparation They still do work. They're still very valuable and We work with a quite a lot of financial institutions and What happens is when you're utilizing automated engines automated decisions a lot of the time these financial institutions? insurance companies Require for an audit reason explain ability and this kept coming up of Reinforcing that the graph was a good persistence store for us I've got some pictures later that show you how explainable it is how easy it is to look at a network and actually understand Why certain decisions were made on data? So it's that explain ability that was also important for this So the end design after around six months can be summarized here What we designed in a decision engine was a recursive decision tree That would organically grow Expand And if we were building branches of this tree that weren't linking to anything We would cut them. We would collapse those branches There wasn't enough interesting information to discover in those paths And this was our first I would say naive approach I've added a little bit on there at the end, which is the learning factor that came around one and a half years later where we Wanted to leverage machine learning to figure out how could we learn from the decision trees and the decision engines that have been built in the past It is worth mentioning That before we were processing data through this decision engine There's a lot of pre-processing that happens a lot of text analysis and For example in some of the techniques. I'm going to show you or some of the results We're running things like cleaning up the data We are running things like named entity recognition to be able to detect objects within text And of course, there's always a statistical approach to this There's no guarantee that those types of natural language processing systems will do the right thing It really comes down to a confidence level And I think actually Our cto martin kind of articulated it quite well And if you look like that, you know, he's good at engineering, right? So Essentially what he highlighted was that When we're working with data Most of us are not working with the highest fidelity version of our data If you look at any data store available today relational wide column document stores search stores object stores blob stores the graph is The highest fidelity data structure we have to be able to store our data And for me, this is the secret. This is the missing piece To be able to fuel machine learning cases business intelligence cases is to deliver to the business the highest fidelity data structure that we have the way I like to kind of I guess go into depth on this Is if you look at any data store that you use today A graph can always Downcast itself to any other structure relational document object blob Going the other way Requires inference It requires you to guess To get to a higher fidelity of your data structure So let's head into what this looks like in action Imagine you have something just as simple as Oracle what is Oracle? This is where this pre-processing step comes in The fact that we can use named entity recognition to figure out that in fact in the context of this text This is highly likely to be a company. This depends on the models that you have available So when you place this through the decision engine And i'll explain the user interface for you I actually do see a couple of engineers that i'm used to working with here So but for those who have not seen the platform before essentially we've got a really ambiguous reference here. We've got a company called Oracle which company which country And i want to fast forward directly to the actual results So once we place this through the decision engine It was able to with a high confidence figure out It's a website it's social accounts annual revenues addresses How does it do this? It really comes down to blowing up different decision trees to explore data The references in the network that link more than others are the paths that we want to follow If we look at this from a different view This adds a little bit of context Where did this reference come from? Well, it came from a book a pdf document in a SharePoint repository. Okay, that's context That's helps And what this decision engine was able to prove and to show If we look at once we've run this through the decision engine This is just a different view of that same data We've got annual revenues stock codes social accounts Not to mention we found new links In references to other records within the network Why this is interesting is because it's turning soft references into hard references It's turning links into deep links And there are many use cases where this can be applied to Especially when analyzing unstructured data Just um for your information Underneath this is actually the decision trees that get built up Underneath to form this type of answer So one of the the downsides I guess you could say about this approach Is that it's very heavy on storage It's very heavy on processing This is not a real-time system But it does come back to would you rather real-time data or high quality data? That's the decision that you need to make for your different use cases If I have more time, I'll show some More use cases later But I want to head into kind of the specifics the details of how the engine works So as I alluded to before we had quite a naive approach to start with I would say it's a brute force approach that we were taking Where every time we discovered data We would build and persist these large networks all the time And if we were ever finding new data that potentially we had seen before It exploded out a completely new decision tree. So there's not a lot of reuse there And when it comes to for example exploring different parts of the decision tree And I alluded to before that there were times where we would cut off certain paths where we wouldn't follow those paths in the decision tree Because they weren't linking enough to other records within the network How could we learn from that historical data? And so two areas we were looking into in the ml space one was a supervised approach and one was the unsupervised If we first look at the supervised What we realized we needed to build on top of this decision engine was an annotation system An annotation system that would utilize Reinforcement learning for data stewards to reinforce some assumptions that we had made in the decision engine So something as simple as is Larry Ellison the CEO of Oracle These were questions that because We had our data Persisted into a network Where typically you have two nodes that are connected via some type of relationship Sometimes it's not a strong relationship, but it's often something like a verb Larry Ellison works at Oracle This would allow us to actually formulate these natural questions to be able to prompt to data stewards And one of the interesting things is we would give those data stewards a simple kind of yes. No And skip type of interface If anyone in the audience has used something called prodigy before it's an annotation tool That's the kind of tool that we use to give to the data stewards. So it's a binary classifier. Yes. No skip And One of the extra wins that we got from storing our data in a network Structure in a graph Was that if I was to answer yes to this question is Larry Ellison the CEO of Oracle There is so much more that you're reinforcing than just telling a system. It did something right or wrong You're actually reinforcing that Larry Ellison is a person Some probably would not agree with that You're also reinforcing that CEO is a job title You're also reinforcing once again that Oracle is a company And then you might say yeah, but we already knew that we already detected that Well, this is more reinforcement that we're doing things right in the engine Then there was the unsupervised approach This one took us quite a while to actually figure out what we needed to do To leverage an unsupervised approach it comes back to this let the data do the work This was the area that we needed to look into In to figure how we could scale this against the big amount of data that we were often working with And before I head into the techniques that we used I'll just want to stop and kind of think about some of the good parts of the approach that we thought up until this point The first and I know I've said this before but I just want to reiterate it It's that explainability of persisting data into a graph Being able to easily visualize How to records in a complete maybe across an entire business are related to each other and One of the challenges in data preparation That really is helped by this that plagues every single business Whether your small media or large is duplicate data And I'm going to go into how we solved duplicate data that problem of duplicate data to a higher precision using the graph Of course in engineering you get wins you get losses every decision that you make in a certain direction makes other things worse And in this case it always came back to this was a brute force approach The amount of time that it takes to process data to explode out these different decision trees using Internal and maybe external data sources as well. This is definitely not a real-time system. You can imagine the the amount of Chattiness that exists between the the application and the and the graph in persisting this could be quite high Of course, you can do some pretty smart caching to to make sure that you're not constantly persisting to the to the graph store But really it came down to this was just the reality we had to face but in the end. This is exactly what we wanted I would rather deliver more highly connected more high quality higher enriched more linked records across data sets than something that I could deliver at a faster rate and then With this weird design idea, we had to be able to add context to maybe ambiguous references Well, the bottom line is a lot of the time you do have more than nothing You do have something If you're connecting with crm systems, you have a lot more customer data than just references to names This design principle was in there to be able to say well if we can make a Something out of something very ambiguous It would be quite fun to throw data that's richer in nature to be able to to the decision engine as well And I'm biased I guess But it comes back to this realization that out of all the data stores available to us today Graphs are good at solving certain challenges Every other database still makes sense for storing data for processing it in other ways But from a fidelity perspective From having your data represented in the highest fidelity, you can't go past the graph It is the highest fidelity version of mapping and connecting data So This is a problem that every business has duplicate records How do you figure out that these records are related to each other? I can ruin it for everyone You're all doing it the same way Every business today is doing this the same way We put these records side by side And we run string distance functions, maybe probabilistic matching engines to figure out What is the leavensteen's distance? What is the hamming distance? What is the jarrow-winkler distance between these two records and then go through all the important properties? If those string distances hit over a certain threshold Then accept that and automatically merge it and then have data stewards that are manually there to fix the false positives That is how every business is doing it Now we also use this approach because it is a very important part of the entire chain I'm very sorry if this is the way that you do it I don't mean any offense by this, but this is quite a naive approach to this problem One thing we are missing here is context What about the data around these records? Whether it's directly connected or indirectly connected So I want to talk about four techniques That we built into our decision engine to raise the confidence of deduplicating data across a business The first was pattern matching and not probably the type of pattern matching that you're used to I mean graph, I mean pattern matching in a graph There were some very practical optimizations that we learned the hard way i.e If I make a certain decision on data at a point in time What happens if that data changes? Do I have to reevaluate every decision tree that was built off that state of data at that point in time? And the bottom line is the way that we designed the system in the first approach. Yes, this is what happened And the entire network the entire graph would just reevaluate and essentially just kill the heap And and and the the data store would fall over So we had to do something about this from a practical perspective And then there's the supervised approach and the unsupervised approach So how could we apply machine learning techniques to solve this data duplication problem? Now to understand the pattern matching piece, I actually have to fast forward to the supervised learning Because that comes before So Here's the problem. It's a pretty classic problem. You've got data coming in from different sources Our task is to turn multiple records into potentially Less records so merging entity linkaging and de-duplication. They're all synonymous. They're all the same thing So this is the kind of problem that we need to solve And these are the type of records that we're used to looking at this is a person record. So first names last names aliases What we're not looking for is identifiers here We're looking for fuzzy references They're the types of properties that we want to be able to detect if I can overlap them When there's no unique ID to be able to overlap these records And our first naive approach was a disgusting set of if statements business rules If the name matches the name from the other record with a certain level of persistence Then mark that as yes, they are the same. We can accept that And this is just one of those cases where machine learning is perfect for this It's perfectly fit for this situation where in fact what we didn't realize is we have hundreds of thousands of records that we have merged in the past Whether they were done manually Or because we had beautiful IDs that linked them What we're interested doing is saying Get rid of the IDs I want to figure out without Perfect IDs to merge records What will the properties look like? I know that they matched. I know that they merged We can throw this at classification engines So like most people do we start with the kind of generic classification models a random forest And it's very explainable It's easy to understand. It's kind of machine learning 101 I'll ruin it for you, but in the end we couldn't use this by itself We actually had to use an ensemble of multiple techniques to throw at this problem The other two were a support vector machine and a single layer perceptron This ensemble of techniques allowed us to get to a precision that we were happy with with automatically de-duplicating records And so here comes the pattern matching part as I alluded before I gave this example of Decisions are going to be made in the decision engine all the time A decision it can only make decisions based off the current state of the graph and data never stops flowing So what happens when new data comes in? Potentially de-duplicates records or discover just discovers new relationships Does that mean we have to reevaluate every decision that was made at that snapshot in time? And what we realized that we had to do was cash some decisions And what this allowed to do when our classification engine realized there's a duplicate It triggered off to the decision engine to persist some data Now the data that it persisted was this So in this example here I've got two I've got three people records a meal a meal a meal and what we actually store Is we store a vector of the types of relationships That are attached whether it's incoming or outgoing to an individual record So it's a proprietary format. I'll explain that the top row Essentially, it's just saying one outgoing reference to a person One outgoing reference to a company on the next line. It would be one incoming reference to a person So we would store these little vectors Actually in the in the graph we would store it actually in some of the the graph stores neo being one of them You can actually store properties on the relationships themselves And what this meant is that when we needed to Explore part of the decision tree or to de-duplicate records We could look up these tables and determine Do we even need to explore a certain part of the tree? for example, if I de-duplicated people It might be that there are certain sub trees that come off that that I don't need To evaluate and this is what helped us stop this decision explosion This ability to have to reevaluate the entire graph, which is not a good thing to do um So that's what helped us in this local caches that would say Don't explore these sub trees unless you change certain properties or you have certain types of relationships Here's just some other examples I've kind of ruined the answer already by showing the slide But this took us so long actually to figure out how we could utilize a neural network to be able to Have that learning capacity Because when you look at the supervised approach that we took we had an annotation tool Where data stewards were manually reinforcing if clued in was doing something right or wrong You know Did we correctly de-duplicate this record? Yes or no The goal of that reinforcement engine is not to tell a system. Why did it de-duplicate? That's why we have these functions for that's why random forest exists That's why these classification functions exist is to let the data do the work What you'll often find is most businesses the way they're solving this challenge as I alluded to before is business rules huge amount complex web complex mesh of business rules And think of the classification approach of reverse engineering the rules from the actual data itself So when it comes to a learning factor I would say that supervised approach still had quite a manual Touch to it and one of the questions we often get In that part of the platform is how long do we have to spend in annotating the system? and often the answer was I genuinely have no idea But it will never stop asking you questions There is always something to annotate. There will always be something to label And so this was the next iteration. How could we remove that manual process and let's be honest In the types of enterprises that we work with banks insurance companies I think a lot of these businesses are worried about removing the human factor from it So that's still important the ability for data stewards to play a role in reinforcing that an engine is making the right decisions The other question that we would often get is well, when will this thing automate itself? It's also a very hard question to answer It comes back to this idea of the more time you spend in labeling The smarter the engine will get over time So then we get to our unsupervised approach The first challenge was figuring out what type of neural network we could actually apply to this if any and What we ended up we're going with was a recurrent neural network That's what we decided on There was a there was always a time factor to data And that's one of the kind of focused areas of the rnn's is that Time is some type of factor in us not space But where we struggled was figuring out what were the parameters that we had passed into our activation functions So if we have multiple layers of the neural network What were the actual components and of course if you've used these before you'll know that one of the Factors behind a recurrent neural network is back propagation. That's the learning factor You know, how does it reinforce that it's making these right decisions? It took us quite a long time to figure out the properties that we would need to feed into this system But in the end it was staring us in the face It was the calculations we were making in the first step the string distance functions And There are many different string distance functions We ourselves use three And for this step so dice um It's actually called dice cernson coefficient. Um, it's a Cernson's a Danish Mathematician and statistician So it was a combination of calculating what are the string distance between the functions Converting those distances into float values and using those as the activation functions that we would pass through to the layers for the back propagation And this was the learning factor This was the piece that able was able to teach our engine how much Does string distances play a role in merging records? And It took quite a lot of data to get it to a point where it was reliable And here's one of the other tricky things about this space. Everyone's data looks different You cannot apply the same models for something like a deduplication against a bank and an insurance company even two banks Their data looks quite drastically different Coming from different systems different formats at different rates And what this means is that using this approach Every new customer we went into There was the annotation phase. They had to spend time in annotating or at least pre training the engine with historical duplicate data fortunately a lot of the kind of common platforms you see for mdm cases for probabilistic matching engines for deduplication engines They have been fortunate enough to at least store the history of these records merged And in some cases some of these systems even allowed data stewards to choose why What properties made a data steward manually merge and clean and prepare this data? And So if you're interested in learning a little bit more about data preparation This is just one part. This is the part that's trying to automate Parts of the data preparation phase and as those of you who are data engineers Those of you who work with wrangling and massaging data and preparing data for use cases You'll know there's certain data that can also not be Automatically cleaned automatically enriched But this goes a very long way for calculating some of those common things that we can Actually clean with a certain confidence level And that's one of the things that came up I think from the big data era Is that what we highlighted during that movement was that the traditional mechanisms for solving data preparation worth scaling And instead of companies embracing that data is either right or wrong What we find is that companies are moving much more to a risk level A confidence level to be able to keep up with the sheer volume and variety of the data and So really working with data preparation is now down to confidence How confident can we be that the data is in a state that we can trust that we can audit That we can show the lineage of where it came from so we can fuel machine learning use cases So with that in mind many thanks for attending. I hope you learned something and Feel free. I mean Neo4j for example is an open source piece of software free to To try out. I would also say it's one of the more easier databases to get up and running with And so getting up on your machine. It's installable in pretty much every single Type of way that you can imagine so spinning up a docker image of Neo4j playing around with it And it's one of the more easier databases to get started with So with that, thank you very much, and I'd be happy to take some questions if anyone has any Thank you