 I am very, very excited to introduce our next speaker, very talented closureista and good personal friend, Paula Geeran. Paula is an avid closure developer and a valued leader in the closure community. She has fearlessly led numerous commercial and open source projects to glory through various perilous technical pursuits. She has also given an impressive number of talks at different conferences around the world. Most of her experience focuses on data storage in processing. So without further ado, let's hear from Paula. Hi, everybody. It's morning for me, afternoon in London. My name's Paula. I work at Cisco Systems. Today I wanted to talk about data log databases. This term first came out when Datomic came on the scene about 10 years ago. Data log had been around for a while though and I wanted to compare and contrast what traditional data log is with where Datomic has brought us. So Datomic was first released in 2012. We heard a lot about it during the release from Rich Hickey, Stu Holloway, a number of the cognitive people. It has a number of operations associated with it. When we're writing to it, we do this through transactions which can be based on statements or entities. When reading, there's a few different APIs. We have the Datalog API, which I'm focusing on today. There's also the entity API and later on they introduce the pool API which integrates in with Datalog. Since then there have been a number of other databases come along. We have Datascript, XTDB, which was previously Cracks, Dat 11, Eva, Asami is a project that I work on. And all of these have been referred to as using Datalog as their query language. So what does this entail? For those of you who aren't familiar with these sorts of databases, what if we look at a standard relational database which is based on tables? So here's a very simple table with some people entered into it. This has proven to be very useful. People have been using it now for something like 40 to 50 years. And many websites and commercial systems are based on data that's structured this way. However, what happens when we discover that we need new information? We need another field? Well, this takes migration where we'll introduce those new fields but they may not necessarily have data associated with it. It's possible to introduce new tables but sometimes it doesn't make sense and it also has performance implications. As new columns come in, we'll find that some of the data continues to be blank. We'll end up with a whole lot of space being unused in the table and there are major issues associated with trying to migrate a database from one version to another. We'll also notice here that each record in the database takes up a full line. Some of these lines have empty spaces in them now. It's not particularly efficient although once you go to a particular line in the database, you've got all of that data immediately available to you. So a graph database takes the same data and represents it as a series of nodes which are connected with arcs. You'll see that each of the columns in the tables have now become labels for these arcs. I've left a couple of the labels off in the bottom corner because it's getting a little difficult to see but you can see all of the data is still there and we're just viewing it somewhat differently. However, a database doesn't necessarily store it this way. To do that, we're going to take those entities and their label, the arc label and the node that they're pointing to and we'll move these into a table of their own which we refer to as triples. Now, most databases will also have extra columns left out on the side indicating the time that that triple came in which transaction it occurred with, that kind of thing. But in general, the data that we're interested are going to be these statements where you have an entity ID, a label to connect it then to the value. In Datomic, these were referred to the entity, the attribute and the value. So what's the difference between these two structures? Well, graphs are much more flexible. As we have new fields, data fields that we want to bring in, we can just add a new arc out to connect that data onto a particular entity. And we need only do that for the entities that we're interested in. We can index everything because we only have three columns that we're working with. We can index on each of those three columns but we can then sub-index within them as well so that if you're indexing by that first column and then the second column or you can index by the third column followed by the first column. There's only a few different possibilities here and it lets us find all the data really rapidly. We can provide metadata on each individual assertion. As I mentioned earlier, that's often done with most databases. That can include the transaction ID, the time, the provenance where the data came from and there's all sorts of other things which can come in as well. In Assamming, I have an individual ID for every statement and then I can make arbitrary statements about statements. It does use more storage space. This isn't as tight and compact in general as tables are particularly when there aren't any blank fields in tables for relational databases. And we can also find that this approach has the potential to split entities up. Typically they're not. We generally aim for data locality but particularly when we've got indexes that aren't being indexed by an entity ID then we can see data for an entity split up across different servers or different parts of the local hard drive. We're also doing statement at a time selection. That's not entity at a time selection. And consequently, just to see an individual entity, we're gonna have to perform a join operation to connect these statements up. On the other hand, we had our relational tables. They're very rigid in their structure. They're not flexible at all. If we want indexes, we're gonna have to select the columns that we want to index or the combination of columns that we want to index. That can be extraordinarily expensive. We can have metadata on rows but that's on the row only. That means that if you've updated an individual field within a row, you don't have an easy way to provide metadata around that update like the time and date it occurred. In general, if it's well-packed, if it's the internal form, we can use a lot less storage space than a graph may use. This can vary according to specific implementations but in general, graphs are gonna spread your data out a bit more. We have greater entity locality because entities are being represented per line. When we're selecting lines or rows, we've got to select the entire thing and that has performance costs when we're processing a lot of data. Column databases are an attempt to get around some of that but they introduce their own performance issues as well. So looking back at the relational form, how do we do a simple query like finding the first and last name of the person with the Twitter handle of Bingley? Well, that where clause generally implies that we have a filter. We're gonna go through each row of the table and pick up on those rows which match the filter that we're looking at. We can index so that we can find that immediately. When an index exists, the query planner will identify that and it will change the filter to an index lookup automatically for you. However, building indexes takes a lot of time, takes a lot of space and in an instance like this where not everybody has a value in that particular column, an index may not make sense or work very well for you at all. Let's try the same thing now. We've done a lot. We're still looking for the first and last name but now we need to a series of graph patterns where we want to find the person who has that Twitter label and then we ask for their first and last name. Our data is already indexed and here's an appropriate index to look for it. These are the slice out of the statements. We're looking at the entity ID, the attributes and the names. Because our data's index, we'll be able to find the Twitter handle of Bingley quite easily, probably through a different index but then using that ID of three, we can look up the person's first name and their last name and that's done by a join where we can simply look that up by the entity ID. So this is what data log looks like but is it really data log? Let's look at a different graph language. In this case, Sparkle. Sparkle is a graph query language developed for the RDF system. The structure of this query is basically identical. We're still selecting by the first and last name. Our where clause still has the same variables, the same attribute identifiers. We're still referring to Bingley and the variable first name, last name is still there. Nothing has changed at all. It's a slight syntactic variation and nothing else. Now for those who are familiar with Sparkle, the colon prefixes for Twitter and first name and last name would generally require a default prefix to find but that's just a syntactic convenience for us. The structure of the query is absolutely identical to what datomics queries look like. So what does data log really look like? Well, it's defined as horn clauses. These are implication statements. The first one here indicates that a statement is always true. This is used for representing data that we want to assert as true and that maps on to where we do assertions of statements into our system. There's also rules. That's where a series of predicates to get taken together will result in a statement being true. This isn't the syntax used when data log appears in a database. Instead, we'll describe it this way where data is a predicate with attribute values associated with it. And predicates need not only have two arguments, they can be longer, but in general, they'll only have two. And two or one. And then we'll have rules and you'll note that the thing being implied appears first and then there's a colon dash and then a comma separated list of predicates. So how does this work in practice? Well, we can define statements, data using particular predicates. The predicates themselves here are parent and then the values that appear in there is something called atoms. We see that the atom's overseas has parent Darius and the atom Darius has a parent has Thespas. Data appears similarly. In this case, we have predicates of ancestor being defined in two different ways. So if there's a parent relationship between X and Y, there'll be an ancestor relationship between X and Y. But also if there's a parent from X to Y and an ancestor from Y to Z, then there'll be an ancestor from X to Z. Now, this is a recursive rule because that second ancestor statement refers to itself. That's a very important attribute in Datalog in general, having this recursion. And there may be many different rules, many different predicates being created based on other statements and there can be loops and all sorts of things in there. So that's how we assert data and that's how we define rules on the data. What about querying it? Well, in this case, we have the question mark dash and we ask for a predicate with variables to be filled in. In the case of parent X, Y, it'll find all statements that match that. For ancestor, we can then say from Xerses to X and that'll find that list of ancestors and it's missing from Xerses to Darius, I'm sorry. So what are the operational features of Datalog? Well, there's order-free evaluation. It doesn't matter which order those rules came in or if statements were interspersed between them. It should not matter how that works. You should get the same result. Depending on the way that evaluation is determined, there are a couple of cases where evaluation under one algorithm can give you slightly different data than evaluation under another, but in general, it's going to be the same. Well, in pure Datalog, it will always be the same. Then we have recursion, which we saw. There's a guarantee of termination and also Datalog itself should not have any negation associated with it. Now, these requirements and the implications of them were heavily researched through the 80s and compared extensively to relational algebra or SQL. Since then, relational databases have adopted recursion. I've rarely ever seen it used in practice. So when this was being researched, we see back in 89, it was compared to relational databases. As I said, most relational systems will now include recursion in some way. Standard SQL can make that very difficult to access. Now, I've mentioned there's no negation here. Datomic does have it. It's actually an extremely common extension and it was referred to numerous times from the 80s going forwards. It's this negation which can lead to different sorts of data coming out depending on the mechanism for processing. But what is that mechanism for processing? Well, most common one is top-down. In this case, you start at your query goal and you're gonna merge that against the rules. If you've got no more possibilities for merging, then you need to exit. Then you'll find those statements where the items do fulfill the expressions, you collect them and you keep going back to the beginning again. You merge against rules once more. If there's no more possibilities, you exit. And this loops over and over again until you found all of the data. Like Prologue, this is doing depth first search. It's usually done with the depth first search. You can also do breadth first. Now, another approach is to use the bottom-up evaluation strategy. Oh, sorry, there's numerous different techniques which are used for doing this. Now using the bottom-up strategy, you start with all of your provided data and you evaluate your rules to create new data statements which then get asserted back into the database. You'll note that the output of rules was always a predicate. This looked like the input of statements. And so we can assert these directly in storage. We keep repeating this until we reach a fixed point. Now, this takes advantage of database operations which are very fast. We only need to evaluate this a single time for multiple queries. If there's an update to the database, we note the delta and we don't need to evaluate all rules. We can just work on those deltas. It can use a lot of storage depending on how much you're trying to infer. And it may evaluate data that isn't needed. However, that can be faster anyway. Recently, I had a case of something with some pathological data that was taking 90 seconds to evaluate with a top-down evaluation. I changed it around to using bottom-up and it reduced this time to just over 100 milliseconds. Okay, so the sorts of terms that we see in Datalog are the rules and the data that we were referring to earlier. But there's actually two types of data. We've got the data that which came in that we defined ourselves. And then we have the ancestor data that comes out through the evaluation of rules. The data that was put into the databases referred to as extrinsic data. Well, data being inferred is intrinsic data. In bottom-up evaluation systems, the intrinsic data gets mixed in with the extrinsic and it becomes extrinsic. It's the top-down systems which have a very clear delineation between two. So Datalog is often compared to predicate logic because it looks like it. However, it isn't predicate logic. However, if we look at it this way, as if it is predicate logic, we can consider each edge in the graph, the node, the predicate to another node as if it's a two-argument predicate. Now, in something like Datomic, we're more familiar with labels like this of attribute on an entity and value. So we can map Datomic's approach onto what we're looking at here in Datalog. So let's put in some extrinsic data. We'll declare our atoms of Xercese, Darius and Isthaspus to start with. First of all, I want to declare them as entities and give them an ident. That's gonna make it a little bit easier for me to talk about them. So when I want to put in these statements of parent connections, I can do that by adding Xercese, Darius and I can refer to them directly now through those ident labels. But what about the rules? We've got those two ancestor rules. We can define this in Datomic using rules. You'll see that we have the two clauses of the rules below. First line is the name and then the remaining lines is a definition of the rule. And these look like queries, although you'll see that they can refer to the other rules by name as well. We can start querying the database now. If you want that simple query where we're asking for the parent connection from X to Y, that just becomes a pattern looker in a find query in Datomic. Unfortunately, the labels will get internal identifiers and so we'll want to add in ident statements so that we can get the labels out. But what about using those rules? That becomes much more complex. Well, here we can take our rules as a vector and we add them in as a parameter at the bottom of the query. And you'll note that the in clause is now expanded to include that percentage that can pick up on the rules. The where clause now isn't referring to graph patterns like we did, it's now referring to the rule that we're looking for. Graph patterns can also appear in there as well. But now you'll see that the request for a rule is looking a little bit different to the graph patterns whereas in Datalog we use the same predicate form for everything. Again, we're going to get internal identifiers out if we get X so we can just update that to ask for the ident. Well, based on this structure, is Datomic Datalog doesn't match the syntax but it does have an intrinsic database, extrinsic database via the rules, it does have recursion and it does guarantee termination. It does also have extensions which Datalog generally allows like negation. So, semantically, definitely we meet the criteria of Datalog but syntactically we don't. So is Datomic Datalog? I believe that it's quite justified in calling itself Datalog because it provides all of that functionality defined in Datalog and much more. We have aggregates, we've got negation, there's a whole lot of functionality that extends on Datalog but we meet all of the semantic requirements that Datalog has in it. So when Rich came out and called his Datalog, his database a Datalog system, he was absolutely correct. It meets all of that criteria. However, the query language for Datomic is a graph query language. It is not the Datalog language, that syntax of predicates and horn clauses simply does not exist in Datalog at all and it is part of what defines what Datalog is. So when we're talking about a query language, Datomic's query language isn't something that I would refer to as Datalog and I often, I'll often make this statement to anyone who wants to listen to me on it. If you want to argue with me about that, then you can find me on the Kludgerian Slack in the Assamming channel. I go by the handle qual and that's on Twitter as well.