 Hello and welcome. My name is Shannon Kemp and I'm the Executive Editor of Data Diversity. We'd like to thank you for joining this Data Diversity webinar metadata and the power of pattern finding sponsored by objectivity. Just a couple of points to get us started. Due to the large number of people that attend these sessions, you will be muted during the webinar. For questions, we'll be collecting them by the Q and A in the bottom right-hand corner of your screen. If you'd like to tweet, we encourage you to share our highlights or questions via Twitter using hashtag Data Diversity. And if you'd like to chat with us and with each other, we certainly encourage you to do so. Just click the chat icon in the upper right for that feature. As always, we will send a follow-up email within two business days containing links to the slides, the recording of the session, and additional information requested throughout the webinar. Now, let me introduce our speaker for today, Leon Guzenda. Leon was one of the founding members of Objectivity in 1988 and one of the original architects of objectivity database. He currently works with Objectivity's major customers to help them efficiently develop and deploy complex applications and systems that use the industry's highest performing, most reliable database technology. He also liaises with technology partners and industry groups to help ensure that objectivity remains at the forefront of database and distributed computing technology. And with that, let me turn the webinar over to Leon to get us started. Hello and welcome. Thank you very much, Shannon. Hello everybody. Okay, let's start here. I'm just going to briefly describe who we are and then talk about open source analytics before saying where we fit in and then moving straight on to pattern finding and applying various kinds of analytics to solve some quite interesting problems. So, objectivity's been around quite a while. We've been no sequel all the time. We've got into big data in the mid-90s. When we say big data, we were doing petabytes by the turn of the century. And you see our verticals here. We tend to specialize in complex distributed scalable database applications and graph analytics, which has become an increasingly important part of our customer base. Few sample customers here spread across the spectrum. No financial customers mentioned there just because they're all very private. So, let's get right into talking about data analytics. If we look at open source analytics, I'm sure you'll all be familiar with this stack here. And we have Spark and Capacitor, Storm. There's mixtures of those things, all to do. Workflow control with Yarn. And then the well-known players at the bottom here, all of whom we partnered with, by the way. And then you run analytics components like ML-Dip or GraphX in conjunction with visualization tools. Excuse me. Now, the good thing about that is that there's a very large number of this community out there and lots of algorithms. Some of them overlap a lot, but we do know that the model works at scale because that's how big data came about. People had been running things in high performance clusters or on supercomputers. And now, of course, the paradigm has shifted somewhat to shed nothing. But those other technologies still have a large part to play. The great thing is that the startup costs are low if you have people who understand technology. And it tends to be very cost-effective, although there are situations, of course, where a dedicated machine can be more efficient. As that's being checked by many benchmarks. And from our point of view, what we noticed was that most of the analytic algorithms are based on statistical conditions, a correlation of clustering like K-means or filtering setting thresholds. And the graph algorithms out there mainly tackle theoretical problems that came about in the, let's say, the social network era. So things like centrality, paid centrality, but all those algorithms are very well understood. Then in the big data environment, Hadoop was set up to deal with files, primarily, not with the metadata. And so although almost all of the platform specialists now are adding tools for dealing with metadata, they're almost all focused on the technical parameters. So they know the name of the file when it was crazy, when it was last updated, happy records are in it, that kind of thing. But they're not looking inside the files and getting the semantic content out of them. And that's really where we come in. And I'm going to assume that in all these cases, the metadata that's in our graph has been extracted from somewhere. And most likely one would have used a tool from one of our partners, someone like Taland, to get the data and blend it and then insert the metadata into the graph. The good thing about all this being object oriented course is that you can have high level concepts over the highest level being the vertex, the nodes and the edges, the connections between them and the properties, the fields if you like, the go in, the vertices and edges. But then you can inherit from those things and do all the things you would do with a normal object oriented language. And so that turns me very good for dealing with lots of variants. Perhaps one of the best known graph tools particularly in the Spark community is GraphX. And it has a very nice, rich set of capabilities that's continually developing. And I'll come back to that in a minute. But you can do just anything you need to do with vertices and edge triplet operations for RGF. And then the graph model location operations and resilient distributed data sets are the Spark construct where you basically have a tabular presentation of the data. No matter what the underlying representation actually is. You then have to perform join operations to do any kind of navigational pathfinding. There are iterative graph parallel operations and triplet operations. We'll see some of those in operation later. And then things like page rank connected. Connected is very useful for finding things like islands. Now, recently, RDDs were enhanced with things called data frames, which put a schema there. And graph frames are more, or indeed that's the name suggests to graphs. Excuse me. So one of the most particular things about graph frames is that there's a facility called Motives. It's a little structure which is a sub graph. And that will come in useful particularly for things like islands. However, it would try to do very efficient pathfinding and very complex navigation where you want to include only certain node types or exclude certain node types or the edges, the connections between them. Then it becomes less efficient. You're doing lots and lots of implicit join operations. And the good thing about Spark, of course, is these things can occur in parallel. But graph finding operations need at some point to find the shortest path or find all paths. And there's no easy way to represent that in a straightforward tabular structure. So that's where we come in. We are doing with complex objects where relationships are first class citizens. And we specialize in ultra-fast navigation pathfinding. We're not limited by the amount of RAM. So many of the dedicated boxes out there rely on RAM. We don't. And because we're a distributed database with distributed processing, we scale and perform really very well. That's all I'm gonna say about our product for the moment, that everything is distributed. And thingspan fits in up in the top right hand corner there into this open environment and provides the metadata store and the very fast pathfinding and navigational access. You can store the data either in HDFS or in POSIX. Just depends on what your environment is. It's distributed from top to bottom. So this is kind of unique. Spark, of course, is distributed. The worker nodes are our clients. We have a distributed database that fits neatly into the HDFS distributed path system. So it's very good at load balancing. You can scale out and scale up. And then what we do is we provide this rest server and we provide these APIs so that you can perform these various kinds of computation analytics. Skip past this, this just shows the components. Let's get right into pattern finding which is the core of what this session's about. So I mentioned before that we noticed that most of the analytics out there were using things derived really from the business intelligence world. Leaving scientific data which tends to be somewhat different. But they're basically statistical. And you're trying to find relationships between parameters. The kind of thing you might be looking for is let's imagine that I'm a credit card company and I decide to change the interest rate on certain kinds of transactions for three months in a certain demographic or area. What I want to do then is come back later and find out whether the number of transactions or the value of the transactions increased. And more importantly, you know whether we've made more or less money out of that little adventure. And that's great. And a lot of analytics is about statistical correlation. But graph pattern finding tends to be rather different. You can find outliers. These become very important in different kinds of pattern finding. And you can do that with SQL or with ML lip components, things like same names. But then you start wanting to do navigational queries to explore out through the graph, perhaps selectively. Or you want to do pathfinding queries to find the shortest routes or the most effective routes between things. The order in which you apply these different techniques depends entirely on the problem. So you might start by finding outliers and then do a pathfinding query, or do a navigational query, or you might start with a straightforward graph Excel algorithm and then move from there and to do a more complex pathfinding, perhaps finding the leaves out on the graph. As an example of pathfinding query here, here's a very simple schema. The vertices are city objects. And the edges, the connections are these link objects. There's many, many relationship there. The mode would be something like road or rail or air or water. The handle it takes to traverse it, typically, and the cost. Oops, excuse me. I hope that'll got that back, okay. So we're going to do four examples. And we're starting with one in the financial arena, one in the government arena, one in advertising technology and then the industrial Internet of Things example. So let's look at money laundering. This is a very simple example I might add. In reality, things are passed through chains of companies and people. They wouldn't necessarily, or the accounts might not be traceable back to the people at the ends of the chain. But for simplicity here, we've loaded everything into the graph and we have people on the accounts and transactions there. And then what we're going to do first is just use GraphX and use the centrality algorithm and the fee like the code, run it in parallel across all the people in the graph and identify people with more than five accounts. That's obviously a threshold that would be parameterized. And we've found one there, person two. So the next step is to explore the graph from there and see where the transaction trial ends and we're particularly interested in ones that end in offshore accounts. And sure enough, person two has this string of transactions all occurring probably at different times, moving between accounts and ending up probably buying some property or investing in a business somewhere and then money is moved from there perhaps periodically into some other account. So the person and the string of accounts and the companies involved definitely bear investigation. Very simple, but all of this can be automated into a few queries and it can all be run in parallel and it can be run as transactions occur. So when a transaction occurs with one of the people that we're interested in, we're beginning to get more and more of the picture because they might accumulate money in lots of places over quite a long period of time before they make this alternate move that identifies them as definitely being involved in the money laundering. Now in the government realm, we've just chosen one in the human intelligence side. So this is what security people, home and security people in the DOD for instance are interested in. And what we've done here is we've loaded the graph with people and telephone call detail records, CVRs, so calls, some places that we're either mildly interested in or very interested in protecting and sightings of people around those locations. And just cut it down here to the people on their telephone calls and we can use some graph X to find islands of callers and pulleys. This is a tightly connected group that clique, if you like, the various words for this and if you look for connected in the graph X or any other algorithms, you'll find things that do this. It's quite tricky. You often need to do it iteratively and of course things can change. You can apply the iteration over 20 or 30 cycles and 75, but the island is broken because some other connection has come to life. But this is quite a compute intensive application and this first step is really important. Once you've found an island, you can start looking at it more closely. So what we're doing now is we're looking at the people in here and we're looking to see whether any of them have been sighted near any places that need to be protected. Sure enough, we've got something there. If you look, you can see that there are two people, I see P14 and P15 who've been seeing near place X which is something of great interest to us. And at that point, we can decide that they're of interest and because they're of interest and because they are just in this clique, then all five people in that clique need to be carefully looked at as well. Yeah, this could be quite innocent, obviously, but in reality, we want to turn the focus on these people and look at it more closely at what they're doing. That's interesting. Something went very blue here, so I'm not sure why that went blue. All right, so let's hope the next slide goes better. So what's happening here is that we want to place ads. So this course is done all the time online and what we have is we have products which are the objects along the top and then we have sales to people and then we're merging that information in the graph with information on who follows whom in social blogging and so on. And it still remained blue. I'm not sure why it's remained blue, but the style there shows us what we're going to focus on. So we're going to focus on product PR2 and we're going to look and see which of these bloggers bought PR2 who also have followers and finds out that Fred, the second person from the left there, bought PR2 and that Mary follows Fred's blogs and Jane and Bill also follow Mary's. So if you're the advertiser, you wait until you see Mary, Jane or Bill pop up on your site or on one of the sites where you're paying for adverts and then you put a very subtle message in front of them to go buy product too or you make an offer to offer them a discount or whatever it's going to be. But this is very straightforward. This needs to run at very high speed. Generally in this kind of application, the initial investigatory work would best be done with Spark and working out the social graph and so on. And then the positioning so that you can present the correct adverts to people at the right time. That would probably be cashed pretty much in memory waiting for the opportunity to display it instantly because you have very little time in the displaying of a page in order to put this information in front of people and present the case properly. So it's a mixture of pre-planning and using Spark and in this case, quite straightforward navigational queries and then combining it with the server technology that will present the ad at the right moment which again is going to be a very heavily memory dependent. All right, so ha ha, this is nice. We've come back to the white world from the blue world. Let's look at another one. Now there are many areas in the industrial internet of things. I'd encourage you afterwards if you'd like to look into this area a bit more deeply and have a look at our website where we cover things like smart homes, adjusting thermostats based on internet information on the weather, what's going on in homes in the area and so on. But this one is looking at network equipment and it could be telco, it could be data transfer or a mixture of the two. So we load all the equipment detail in and the links and the loadings, the percentage load on those links into the graph. Now this will get updated gradually over time but in reality, apart from adding new subscribers out at the edge, not much changes in major networks on a minute-by-minute basis other than routing but you don't necessarily add major pieces of equipment every five minutes. So the first step here is to use Spark SQL and just run a very simple select query to find all the links that are 90% loaded over 90% load, I'm sorry. And we found one right there between E22 and E31, 32 and 33 and it's link number six. So now we do a more interesting navigational query and we go in all directions basically but in both directions back to the left and to the right looking at it from this perspective to find the leaves of those sub-graphs. And the reason we want to do that is we want to find out where the traffic's originating. Now in a real telecom network, what happens is that individual pieces of equipment have thresholds set locally and if things go wrong, they raise alarms and the alarms are transmitted out to management servers that will reroute, you know, turn on more equipment or reroute things. However, those things by and large work very locally. So when the box is overloaded, it knows there's another box next door to it so it will switch that one on. That may not be the best thing to do that may be better to completely reroute things through another city or a completely different set of equipment and that's actually what happens here because once we've found that the data's coming from a couple of places and ending up, sorry, it's coming from one place at the right-hand side there and ending up a couple of places on the left-hand side, then we can decide what we actually want to do about it. And the diagnosis is pretty straightforward. We've found, in this case, we're pretending that someone's loading very high definition, ultra high definition TV video from coast to coast. And so at certain times, probably periodically that's a popular program, for instance, the equipment's gonna get overloaded. So we can go into predictive analytics as well, but for the moment, all we need to do is switch on this other link and then some of the traffic will automatically reroute. If the box is individually, we'll look for the most available route to send data out. So the solution here was actually very straightforward but the graph query in its own right is quite complex. In a real network here, we're probably talking from end to end, maybe 30 to 100 hops. And running that kind of query on a traditional database, a relational database, or actually most no-sealful technologies, is gonna involve a lot of lookup operations and join operations. So if you have a graph on the structure underpinning all of this, or if you can load it all into memory, which is quite often possible with Spark, then the thing's gonna run a great deal faster. So what we've looked at quickly, I'm prepared to take lots of different examples here, but I think we wanted to give you a quick view of the kinds of pattern finding that are involved. Either you can combine open source, fast and big data tools, which are really good at what they're designed for. And then if you extract the metadata from files or from the other parts of data that you own, and put it into a graph, then you open up new possibilities. And the key to it all is being able to do ultra-fast selective navigation, and also do pathfinding queries. And the pathfinding queries are particularly difficult in a very large graph. And also earlier, the kind of island finding algorithm that we looked at in that human intelligence problem is difficult to crack without the powers and the power of something like Spark. And with that, I think there are a whole lot of questions here and it's really where we want to spend our time. So back to you, I think Shannon, as moderator on this. Thank you, Leon. And of course, one of the most common questions that we get throughout the webinar are people asking about the slides and the recording. Just a reminder, I will send the follow-up email out to all registrants by end of day Thursday for this webinar with links to the slides, the recording and anything else requested throughout the webinar. First question coming in for you, Leon, is for these data, for example, the network congestion data that we are looking at needs to be loaded first and the data keeps changing all the time. So how does the loading of data that's changing every second happen using what tool? That's a good point. A lot of the data is like that. There's a lot of streaming data. So it wouldn't just be from telecom data but financial transactions, for instance, when we looked at the money laundering case or if we were looking at trying to detect fraud in trading in trading stocks or derivatives bonds, that kind of thing. So you can use Spark Streaming, Spark Streams for some of this data. In the telco network, that would probably be okay. It would probably be fast enough to deal with the situation we saw here. It wouldn't be fast enough to deal with things that need to change at the couple of millisecond level. And those things can happen in telco networks, particularly with alarm swarms occur or if we have some large event in an area and the local equipment gets congested. In that case, I recommend using something like Kafka. We have approval concept, for instance, that deals with the financial case and it's using Kafka to take the income in financial transactions, data about the transactions. And then it's using the queuing mechanisms in SAMHSA to divide up the work in order to load it into the graph in real time. So in near real time, obviously there are IOs occurring. And because in our particular situation, we can either load things in consistently with asset transactions or we can pipeline things so that they enter the graph, aren't necessarily visible to everybody until everything's been connected up. So I'd say a combination of Spark, Spark Streams, Kafka, Coom, there are open source technologies out as well, which would deal with the task moving stream are important. In a real system, besides just doing the analytics in the background, so to speak, you've probably put algorithms and thresholds and triggers into the front end of the stream, do some complex event processing and then take information from those triggers to enhance the graph as well. One of the things right back in the beginning, I showed a picture where you saw the open source things and you saw Kafka streaming and you saw things span there. Well, rules get derived and the parameters change and a smart system with doing predictive analytics will feed back new parameters into the complex event processing stage so that it will detect more things or throttle things down. There are too many things that aren't of any great interest or don't turn out to be of interest. So combination of technologies, we have applied Kafka and SAMHSA with some great success. I love it. The next question coming in, would a model of having a relational data store no SQL database for metadata work? The actual data can be in graph databases, of course. Any comments on that? Yes, quite. Graph databases have had a resurgence in the last few years and as a network source community and our own products, object databases are inherently networks. That's what they were all about. The reason that objectivity came about was because we were having, in our prior lives, we were having a lot of trouble dealing with highly interconnected data and data that had lots of variants. And so the object database became a very powerful tool for working with that kind of data. Now, over time, obviously the relational databases have a very, very significant part in the data world that we never say that we compete anyway with the relational databases because that's not our aim. And I think with those SQL players, the fact that there are so many of them there are testimony to the fact that not everything fits neatly into tables and document stores and key value stores and so on all have a part to play, but they play in different parts of the spectrum. And the graph databases by and large so far, most of them don't scale that well or if they are distributed, they still rely very heavily on memory to perform anything anywhere near well. So you really need an engine that's being built from the ground up to deal with complex relationships, many to many relationships. And then you need the kinds of queries that are hard to do with SQL. So give you a simple example, if you have lots of tables and lots of columns and you try and find the connection between a row in one table and a row in another table and there are many tables involved, that gets a lot of joint operations and it gets very, very slow. Whereas with a graph database, that's going to be very fast, but then there are different variants on how you tackle the problem. So you have the message-based distributed approach, you know, cradle, basically GraphX is based on as well. And then you have the single logical view, single logical view distributed database, which is what we supply. So I think you do need to look carefully at latency, how quickly data is coming in, whether it's mainly read only from that point on or whether it gets changed a lot, the lengths of the paths, if you're pathfinding, the numbers of different kinds of vertex and edge that might get involved in a path. And then you've got these graph-wide problems, if you like, like finding islands, finding the span of the graph is important in some applications, where you really need to go very highly parallel and apply a lot of computing. There's going to be a lot of IO involved as well in that kind of problem. Unfortunately, I think there's no one... And here, we've got quite a few coming up, so can you describe the metadata you're speaking of and how that gets captured? Right, yes, well, the metadata can, it can actually be any kind of data that you can extract from the data that's in your resource. Let's take an example. Let's suppose that you have video streams being captured off of cameras somewhere, let's say in a big facility in the airport. So the video streams themselves can go very neatly into specialized storage structure or straight into HDFS's files. They can chop them up into five-minute segments or something. And you obviously want to record what those segments are, so you can go back and put them together in the right order and so on. But then, let's imagine that you also have place recognition algorithms or date recognition, where you're looking at how people work, which is characteristic of individuals. So those algorithms will run when you're looking at the data at the front end. Someone will take quite a bit of time to run and then they'll extract information which will be, like fingerprint information, they'll be condensed into some middle structure, which in itself might be a little graph structure. And it's that data which you then would store into the graph. Now, you'll have some fixed data. There's almost always some fairly static data, like the locations in the airport, you know, the gateways in the airport, the tunnels under the airport, all those things are pretty static. And you'll want to relate the things and the people to where they are, where they pop up. You know, right now they're in the store, and now they're in the airline cloud, and now they're heading down towards the gate. Oops, they just went through a door somewhere that goes out onto the concourse, that's not good. So that's the kind of mess data. It will depend entirely on the problem. And I think the more you talk to data scientists, the more you find that when the first thing they discovered wasn't in the data that they were already collecting, they discovered they need to find other kinds of data as well. Or that the data was being collected, but it was being thrown away quickly because they didn't have a means to store it, or the means to analyze the volume that occurs. We have systems that are really huge running. And I don't know with how many new nodes and edges are added per day, but we're told that the analysts who use those systems regularly process tens of trillions of nodes and edges per day in going back to their normal business. Now some of that is done in machine learning, and some of it is done in pretty much ad hoc queries, or in modified queries that are canned queries where they add things to the queries in order to discover new stuff. And of course the picture's changing all the time. So one of the great things about the graph, particularly to have a flexible schema as we have, is that you can add new node types, new connection types as you find the need for them, or you can go back and consolidate things. So you might find out that you're not really interested in whether they got there by rail or road or water, or you can just say they got there. And create a new kind of relationship is just that it was a trip between place A and place B and the person or persons who made those trips. So the whole thing is changing. And in some situations, in the intelligence field, for instance, then new kinds of information can get discovered every day, and that needs to be propagated out to all people who need to be cognizant of it. You don't have to write the scope program, so I was just thinking, blending's become a big topic. Tools such as talent, which then can extract data from other places. Waterline data, platformer, there are lots of different tools there. Some of them require a program or effort. Some of them are becoming much more automated, so you can just present users, data scientists, or even end users with a palette of different kinds of data and sources, and they can type them together and get analytics out quite quickly. If you've never tried it, I suggest a small scale, looking at something like orange, which in 10 minutes, you can figure out what it does. You can hook it up for simple data sources, standard kind of data sources, and the CS3 files and spreadsheets and so on, and get a good feel for that kind of platform. That's why we are an engine builder. We don't build visualization tools. We provide things that look at our data, look at our graph of course, but that's not what we intend to do. So we don't do the machine learning, we don't do the blending, we don't do the visualization, but we have partners who do all of those things. Sure, that makes sense. I don't know if you have something handy, but maybe we can get something to send out in the follow-up email. Do you have a slide of a sample technology ecosystem? Sorry, a sample technology ecosystem? Because that was what you said. I think, yeah, I think we have several. There may even be appropriate one up on our blog. So yes, I'm not sure that we do that when we send out the answers. I think they'll love it. And I'll get that out to everyone in the follow-up email with links to the slides and the recording. And is Power Finding done at name level, data type, metadata item type, or even further? Oh, that's a good question. Yes, because we get into schemas and talk about metadata and then there's metadata about metadata. Yes, in a good kind of finding system, you should be able to use properties of the individual objects and connections or go by the types. And if you have the inheritance in it, of course, that becomes even more powerful. So you might do a first query, for instance, to find out whether there's any connection between person A and person X, just any kind of connection. And so that can work at the high level, depending on the instrumentation of the database. If it's got a good strong graph database underneath there, then it's going to be able to run across any kind of connection regardless of its type and find whether or not there is something. And if the platform is good, it may be able to actually answer the first part of that question without having to reverse the graph at all. So we actually have technology which will read out some things which are not going to result in a path very, very quickly. So the ability to include particular nodes and exclude others I think is very important by type. And then the other things that become mainly properties or it may be that you don't want to go past a node if that node is connected to another node of a specific type. So let's suppose we have nodes that represent kinds of weather. And I'm traversing the graph going from the West Coast to the East Coast and I get to St. Louis. And I find that St. Louis right now is connected to a thunderstorm node. Then I don't want to go any further but I don't want to get to St. Louis. Now I need to explore other parts instead. So yeah, it is important. And a good graph query will involve types and inclusion and exclusion as well as the actual individual properties. You can also, of course, as I said, you graph start everything. You can go in there and find things with straightforward SQL to get started and look for outliers or a client machine learning library to use other categories, some other kind of index, for instance, or a collection. So a collection that's been built, a collection built like a sub-graph, so build a sub-graph. And then that gets enhanced as things get updated. The one that comes to mind was personally built with a legal database company some time ago but answer questions for lawyers about particular types of case and the case all went with it. And then if anything changed in another judgment, then all those lawyers were sent updates, you know. Their pages were hit in those days to tell them that something had changed in the book. I just love all these questions coming in. Thank you so much. Next question is, is it possible to define a schema template for a particular data set in a graph database and enforce compliance with that schema? It depends entirely on what you decide, how you decide to store the data. If you're putting it into some kind of unstructured data store, you'd have to recognize a type and apply the rules yourself. If you're putting it into any kind of database that has a schema, then it will enforce rules on types. And particularly on the connections, in our case, for instance, all of our products will enforce cardinality rules and the connections have to be known of a known type unless you're using just a generic link. So I'd say a good graph database will definitely raise flags if you try and put the wrong kind of connection between things. If you try and store something you don't have a schema definition for it, you're going to get an error as well. But the good news is that you can then handle that as an exception beside whether to add a new inherited class or create an entirely new class. Sometimes those things require human intervention or they can get out of hand. If you think of a set of data that's got 20 components, how many different object classes can you make out of those 20 components? If you let the machine do it automatically, you may end up with a graph that doesn't tell you as much as you want. And right along those lines, Leon, for data with multiple hierarchical levels, would a graph data store be a good fit for a document store like, well, would a graph data store be a good fit for a document store? Yes, in general, document databases have their place. They're geared to store chunks of text or structured text these days. And they have generally relatively few connections between the documents. There is a bit, obviously, with scientific documents we might have a hundred references. But in most cases, you've got a bunch of data that lives together for most of its life and it has some connections to other things. And the document database is perfectly adequate for most of those kind of problems. With an object of graph database, you can break the thing down into smaller components and you could even break it down to the word level, literally to the word, so you've got connections from the word to all the sentences that use that word. What you've got to watch out for is the balance of the kind of queries you're going to do, simply graph queries, and the amount of IOs you're going to do. Let's imagine that every node, the main data type I have is, let's say it's a 40 gigabyte video, for instance. If I can store the 40 gigabyte thing and I can store another one next to it, now if I can store the connection, the connection is going to have to go somewhere else. And so it makes sense to read the data where it is and then just take the metadata, so the name of the video. And then any other things you're interested in, know who are the characters, who's the cast, who's the director, all the kind of metadata you find in something like IMDB, for instance. Very good application of graph online. And then you can ask very interesting questions about having to go and do all the IOs involved in pulling up the individual objects, the movie objects, because clearly that's not necessary to do refined research. And this next question is very specific to the objectivity database. So how much integration work is necessary for objectivity APIs to use Spark, RDDs, data frames, and data sets? Very little, any of the box with objectivity to be itself, you would want to write a Java program to do that until recently. With thingspan, we've done that work for you. So there's a component in thingspan, called thingspan for Spark, and what it does is go into, it works two ways. It can go into the database, the underlying thingspan metadata store, extract the schema and produce RDDs. You can guide it, because you may not be interested in all the object classes in there. Maybe you've got 400 object classes, and you only want three or four of them available up to machine learning like this, for instance. So you define what you want in the RDD, and then we generate, we automatically generate the RDD. And then if the underlying parts of the graph change, then we automatically regenerate those things for you. So that when you apply other Spark components to the RDDs, sorry, to the data frames, we present them as data frames with schema. So as you apply components to the data frames, they would tend to use, as a Spark will make sure that you have work modes that are close to the data frames in memory. So they're needed by that work mode. And then if anything changes underneath in this level, because of updates coming in, say, from streaming data, then the data frame itself will get refreshed, and the rest of the data frame will get refreshed as well. So it's pretty much seamless. Now, going the other way, if you drop thingspan into a Spark environment, and then you use the regular data specs to define data frames, and then direct them so that they go into thingspan, then it will start collecting that data straight away and building the graph. So you don't have to use our tools to create the schema. But the problem at the moment is that data frames are tabular. And that's where graph frames are much more interested. The spec of graph frames just about firmed up, I think we're gonna start seeing real things that we can work with in June or July. Because they have a much more natural fit to the underlying graph structures, I think that's gonna make life a lot easier. But right now, yes, you can define it, and we will figure out that there are connections and graph connections in the underlying database, and make sure that we exploit them when we use the data frame. Because we provide adapters as well as the data structure. You may think we're doing a join up in Spark SQL, for instance, but in reality, we'll be using the relationship then between the customer and their account, you know, and a particular transaction. Well, as you just mentioned this. Sure, yeah. And as you just mentioned this, I'll skip to this next question. Do you have customers that work with large Al files? Large what's the file, sorry? Al, O-W-L. Oh, Al, I'm sorry, yes, I can't do that. Yes, we've had customers work with large Al files, mainly over, I remember correctly, they were over in the advertising world, the ad tech world. Al was very interesting. It started out with a lot of strength behind it. Then we found that our particular customer base over in the intelligence community found a lot of shortcomings with Al and they were generally using the horn logic, for instance. So it didn't catch on quite the way it might have done, as far as we were concerned. We do have an example, I think, up on the website. We go into the developer network support side. You'll find some examples of our IDF. We don't supply Al tools or any of that infrastructure, but there's no reason why you can't store Al and store RGF very cleanly into things down or into the graph database. They're going to the next day to store with no problems and we'll say that there are examples of doing that. So we don't provide tools that direct the access and they would have to work in our environment, they would have to work either through ODC or probably with things done, work through data frames and then ultimately graph frames. Hopefully by the end of the year, through graph frames. Sure, and I think this next question is one that's necessary when we're talking about metadata. How do you exploit the data quality results in the pattern finding in metadata and its data lineage? That's a great question because I wish I had a very simple answer to that one. Data quality is a huge thing, governance and provenance, however you term it, risk management, getting the quality of information. It's really the outside to scope for the database. You have to use algorithms, you can use machine, learning library algorithms, people have their own proprietary algorithms. Because of course you can mix and match your own algorithms or anyone's algorithms with things like MLlib and GraphX and SQL and our own API. Now having discovered something, gone to all the trouble of finding something and deciding that particular kinds of data or sets of data are of high enough quality to warrant further attention. That's where I think you want to enhance the graph and you want to keep that little sub-graph of the answer set or the conditions in the graph so that they can be used by other people. And I think this is becoming increasingly important in a lot of domain stage. We particularly have self-help. Data scientists setting things up and make things available to users and then users going in and trying to find stuff for themselves because if you have your own devices and with our guidance, then I like to find a lot of false information. I've seen a very lovely site up on the internet somewhere that correlates anything with anything. You can tell it to find things that correlate with the price of cheese and that will come up with all sorts of very interesting looking things that correlate statistically. They have no meaning whatsoever in all probability. If we'd like to address a little bit more detail, by the way, then please just ping us at infoobjectivity.com and our people will reach everything to me or the right person here and we'll get back with it. Some specifics and tell you what people have actually done using our products. Yeah, sure. Or that we've seen in academia or whatever, so we're very happy to help. Yeah, yeah, absolutely. And if, Ann Arlissa wouldn't mind following up sending me that link. I'll make sure that gets in the follow-up email as well. There's certainly some requests for that. Can you run... Well, actually, let me stay on the data quality one more, a little bit. What is it about data quality trending results to it? I'm not sure I fully understand the question, but... Well, this is, yeah, it's a matter of selective and adaptive field training, I think. More than anything. So you may... I think one good example of this occurs in data and telco networks. Have a particular device that goes down and it's got 10,000 people connected to it and that causes a swarm of alarms. So that can ripple out. You get this, you get a real ripple effect through the network. We don't clamp it down quickly. And so it's very important to detect thresholds. And if it happens, and it happens frequently, then we want to go back and look at it over a longer period of time, try and figure out what triggers these things and how to prevent that occurring in the future. And we saw a good example of that a few years ago back on these codes when the electricity grid went down. It would have been preventable if there had been some coordinating body that was able to catch what was going on. This system wise, of course there wasn't. Individual utilities were dealing with their own point problems. Once you've obtained a good filter, though, then it can get applied at the complex event processing end. Or you can go back by data mining using machine learning and either decide to exclude some kind of data or just aggregate them. There's one field of interest here that it's worth looking into that's still evolving and that's the field of granular computing which is really a mathematics discipline. But I think it's very, very interesting because it's certainly gonna be beneficial in handling some of the data quality problems that we see. All right, we just have two minutes left. I think I will, so I'll throw out one more question to you, Leon, I'll get the rest of the questions to you that we didn't have time to get to today. For any follow-up that you wanna do, can you, just going back to the owl, can you run inferencing programs that operate against an owl file to determine erroneous data and project missing data? That's an interesting one. Well, if you know something about the connectedness, it is possible, from my own experience, whenever we have taken data from some of the traditional databases or from those equal sources, then you find a lot of ambiguity. So you'll find things that are connected that shouldn't be. This is particularly the case where humans are involved, by the way. So people have to fill in a form online and they don't know the answer. It could be a contract code, for instance, you know, you're a manufacturing company or a sales company. Don't make one up. So you get all these false bits of data in the system. So I think, you know, it's, it is, I think this one deserves a longer answer. So don't mind, I'll print that one and we'll answer it in the follow-up. Sure, I love it. Leon, thank you so much for this great presentation and for the Q and A. And thanks to all of our attendees for being so engaged in everything we do. We just really appreciate all the great questions coming in again. I'll get on anything unanswered over to Leon. And thanks to Objectivity for sponsoring today's webinar. Just a reminder, I will send a follow-up email by end of day Thursday with links to the slides, the recording, and all the additional fabulous information requested throughout the webinar. I hope everyone has a great day. Again, Leon, thank you so much. Thank you, Shannon. Bye, everyone.