 Let's get started. We're happy today to have our friends from DOLT to come and give a talk about the new deep system that they've been building for a while. They've made some, they show up on Hacker News quite a bit for being the GitHub for data. So we have Oscar Batori and Zach Musgrave and engineers working at DOLT to come and give a talk about how this works. But so the way we'll do this is if you have any questions, unmute yourself and just interrupt and ask a question as we go along. But be sure to say who you are and where you're coming from. So everyone knows who everyone is. Okay? So with that, Oscar, happy for you. Glad you're here. So go for it. Well, thank you, I do appreciate it. So, you know, first of all, thank you for having us. We're excited about the work we've been doing and excited to share it with a sort of, you know, a steam group of database enthusiasts and professional researchers. There we go. Okay. So I'm going to give a two-part talk today, a two-part talk, our motivations for building DOLT, the high level design goals, considering these motivations, like what we're going to do and then we'll do a demo of the product. And then we'll hand it over to Zach and he's going to talk about DOLT architecture and the core data structures in particular, a normal data structure called poly trees, which we think are pretty interesting and have governed some of the attention and decided to use them. So just quick background, who we are. I'm also sorry. I am, in fact, not the software developer in my capacity at Liquidator. I am a business development person. But I previously work in engineering roles at BlackRock and Blue Mountain Capital in sort of quantitative and systematic strategies. And I'll present the sort of motivation for the design goals and give a demo and I'll hand it over to Zach, he's a senior software engineer. He's previously a Google Cloud and Amazon. And he'll present sort of the, some stuff about the architecture. So with that, we'll jump right in. So I think it's helpful when you're discussing motivation, sort of get some of the backstory in. Liquidator is the company that builds DOLT. DOLT is an open source database. It's in Apache 2.0 license. So, you know, anyone can use it. And DOLT Hub, which is a collaboration layer built on top of DOLT for hosting DOLT and sharing data in the DOLT format. And I think this sort of terminology of a format in the database is a little confusing, but it'll become clearer as we jump in the talk and then the demo. So the company was founded in mid 2018 and the corporate mission statement was to bring liquidity to the data market. So to actually make it easy to transfer data. And started out with speccing a marketplace style features for data transfer. And they sort of realized that data transfer itself was the actual friction point. It was, you know, the set up, the technical infrastructure was actually the sort of limiting factor and building, you know, slick interfaces for exchanging legal information, signing documents and payments and all the other stuff that happens around that technical transaction was, could only take you so far. And if you wanted to really improve things, you need to actually change the way data move from in between organizations. So the kind of out of that realization came DOLT. And DOLT is a database that incorporates concepts from Git and relational databases. It's sort of the motivations for the Git piece are sort of rooted in this idea of reducing friction with regards to moving data around. So, you know, if you recall that prior to Git and GitHub, it was rather more difficult to obtain source code than running a single command and sort of GitHub really changed that sort of meaningfully and it completely eliminated the friction involved in obtaining code. So we sort of wanted to draw inspiration from that technical story. And we also recognize that SQL sort of common EDL, like everyone understands it. And by everyone, I don't just mean people, I also mean like computers. There's just a vast ecosystem for managing relational databases and expressing data in terms of in terms of SQL. So we sort of wanted to bring these two concepts together to drive a sort of new database with a new set of goals. I think the main distinction is that, you know, relational databases, I think, in some original sense, OLTP databases optimize for a blend of read and writes to the application back in stores. And I think that they've come to be used for a lot of things that are not application backing stores that are more sort of sources and sinks in ECR workflows. And so dot actually elevates a separate set of considerations around that. So give you an example. This example is drawn from my professional life. You know, I used to work in systematic trading, which means we did a model based investments. Basically, you sort of create a model and you map into some tradable securities. And in order to do that, you often have to map from different ID spaces. So third party vendors will give you information about the world, and then you will sort of map that information into tradable securities that you have sort of agreed with your clients, you're allowed to trade. And this sort of presents a number of complications. As you can see in the schematic, you're pulling in data from third party vendors, you're ETLing it into your own databases, you're then doing a bunch of transformations and overrides, and then you're sort of using that data to drive your trading process. It's, you know, small data. In my experience, it was like 80,000 rows. It's extremely valuable. Mistakes in mappings can lead to incorrect trades, and that can internally to financial liability. So it requires a ton of scrutiny before anything sort of happens in production. Now, there's like multiple sources of error in this, you know, you can have an error come from your upstream vendor, you can have error in your SQL code, you can have error in your transformation and your overrides. And so you sort of, if you want to build this on top of relational database with the sort of last right wins architecture, you have to build all the application code around that. And if you want to expose telemetry into like what went wrong and where you have to think that through a priori. And so what kind of ends up happening is that you sort of try and build something clever. It doesn't quite work. And then you sort of have to have a human operator intervene anyway. That's at least that's what's happened in my experience. So, you know, we like I slaved away at this problem for two years. It was a consistent source of SLAMS is in trading. And, you know, the sort of features we lacked was a database sort of generated a data commit graph. And so we like sort of move forward and motivate this in context of using doll. So, you know, the way you would use doll here is every right would be a commit, right? And your your data flow would generate an implicit commit graph. And then if any stage in your process went wrong, you would be able to immediately back into where where that happened. So if the vendor introduced in Ronius result, you'd be able to see in the diffs, you'd updated, you pull in an updated value from the vendor, and you could roll back and rerun your process, generate a new commit, a new set of mappings, and then proceed with whatever downstream processes need to go forth. And the sort of really important takeaway is that you don't need to write application code in order to achieve this. The database has native features that elevate the consideration of generating a set of commits associated with a set of writes. And so that's kind of the thing. So that's kind of like an example of a problem we believe is underserved by the existing set of sort of database products and why we thought it was worth building new ones, definitely not the only problem, but it's an example of something where a database with adults core features can add value. So now sort of backing up an example to sort of high level design goals. I think there are three that are worth highlighting because I think they very directly most of the technical aspects of the talk. So I think the thing we wanted most was a fully portable format. You know, get clone and get pull, I think it really like inspiring. I think they're just like that's become the standard for distributing software. And I think that we wanted to push that a step further. So instead of just, you know, if you do get pull and you get clone, some code, you still have to figure out how to run it. And that's fine. Software is generally like that. We wanted to drop in a production database. And by that, we mean we wanted people to or we wanted users to be able to clone data from dot hub or some other remote run one command and then have a SQL server running. So I don't know if you came in early, Andy and Zach were discussing the nuances of the SQL server we run. But the idea is that anyone who finds a data set that's in dot format can clone that immediately run it and it will just wire into that infrastructure. And the schematic sort of illustrates that, right? You're using dot call to pull in updates. And you have a sort of on your own premises, you have adult database running, and then you're able to either sync it to your existing database infrastructure or just wire into it via standard ODBC libraries that we know of love. And I think kind of like this is under kind of much more radical than it first seems. You know, if you're talking about all the infrastructure that goes into maintaining data sets that like companies rely on and sort of drive business processes, like, you know, you're talking about ripping out the ETL stuff, schemas, everything and just cloning something into your cloning the copy of something into your infrastructure and then subscribing to updates. And I think that's like a radical change in distribution coming back to our sort of mission statement for the company. So I think we, the other thing we wanted to sort of bake into this is sort of tools for discipline collaboration. And I think, you know, if you look at something like Uber and Lyft, they both use Envoy. Envoy was created by Lyft. And I think that would have been unthinkable 10 years. And the reason is, is because we now have tools that allow them to just outsource a piece of that infrastructure to something that the community of sort of Envoy contributes collectively owns. And I think that's sort of, it's a sort of cost benefit analysis where these two companies, even though they're mortal enemies, recognize neither of them is competing on how good their proxy is. And so they're able to sort of use these tools to like, just drop in a piece of infrastructure that solves a problem that they have. And we're sort of in the dark ages of data collaboration with companies often paying for data that could easily be maintained with the right tools. I mean, we had a CEO of a company who emailed us telling us his entire industry is paying data brokers for lists of entities, and they all have a shared interest in maintaining such a list, but they like the means to do so. So dole aims to kind of empower companies, you know, entities that are interested in data sets to actually like, you know, take their destinies, their own hands and stop paying third parties. And I think the distribution model that we highlighted, you know, this drop in infrastructure kind of really speaks to this right, the ability to just drop that data set in as a resource into your, your sort of local ecosystem. So yeah. So the third, the third thing I want to highlight is data management primitives. I think this kind of goes back to the, the example of mapping tables. Mapping tables are certainly something that you would want to manage. When I work at BlackRock, someone once updated a mapping table, and it caused BlackRock to have to write like an 18 million dollar check because we traded the wrong space. What's an example of a mapping table, like entity resolution stuff? So mapping table, yeah, it's a really good question. Sorry, that was maybe a little abstract. So a mapping table would be like securities, mapping table would be one, one data vendor identify securities in one ID space and another in another ID space and then you trade in the third ID space. So like an ID space is just like a set of IDs to identify some objects in the world, right? In this case, securities and then a mapping table just traverses the different ID spaces. It's entity resolution. Yeah. Okay, yeah. So we never referred to it that way, but that certainly sounds like the correct. Having tables is like, if you're actually in the trenches, entity resolution is like lofty ivory tower. Okay. There we go. So I very much put my career in the trenches, but thank you for emitting me to the ivory tower this afternoon. So these, so you can think of these entity resolution tables, right, as being something that could, they tend to be small and they tend to be maintained by humans. You know, client mappings is one of them where you map sort of database IDs to clients. So you sort of render the correct thing in the correct place for the correct client. You don't want to render the wrong one. So there's a ton of, you know, there's generally data that configures code, sort of, you know, code often exposes parameters, right? And these parameters end up determining a lot of the system behavior and like, so they should be treated as such. And so I think the adult is a tool that sort of, you know, exposes multiple interfaces for analysts and like domain experts to manage this stuff without having to get into the week. You know, we've got, we've got a Python API. We're going to sort of do a Google Sheets integration, et cetera. We're going to offer a wide variety of places for non-technical people to manage data, but allow them to make use of robust version control tools so they can, they can sort of do this in a safe manner. So maybe this comes later in the slide, but like in your example here, you have a bunch of people making changes to the same dataset, they push the adult. So what if, like, are you expecting them to be writing to the same branch, like to use Git semantics? Like, how do you handle conflicts? Like, what if somebody drops a column, they have a guy as a column, right? It's one thing for code where actually a human can sit and decipher the conflicts and make sense of them. Sure. This one, if I have a conflict on a billion records, you're kind of, you kind of should add a lot, right? Can I take this, Oscar? Yeah, go ahead, please. So, so this is a great question. I think it really speaks to the value proposition that adult has as a database product. You're exactly right. How is a human being supposed to make sense of millions of change rows? And the way our answer to this is the same way you make sense of any change in a database, you run queries. So in addition to the kind of sink and pull and merge and branch capabilities that Oscar's been talking about, adult actually makes it possible to query the diff of any two commits in the commit graph. So if I'm doing a merge and I have conflicts and I want to understand something about the rows that are in conflict, I can write a SQL select statement that selects from the rows of those two, of the two different table divisions that are in conflict only, right? So any query that I want to analyze on the database as a whole, I can also run that just on the subset of data that's changed, just on the subset of the data that's in conflict, et cetera, right? So we really put the power to understand the data in whatever way you are normally using SQL, you can apply those same analytical tools to your diff and merge workflow. So actually, I think that Zach highlights a really exciting feature which maybe I should discuss when I was talking about the mapping table example, you know, this idea of a data graph, right? When you're using a database that has an underlying commit graph and you're associating each right with a commit, you can, and then the differences between those commits are themselves data. And, you know, if you want to build a workflow around that, don't expose that data to you. You can build stuff on some of it. You can build dashboards that highlight differences, right? Just by writing SQL. There's no need to go, you know, endlessly writing code to highlight exceptions when your data doesn't meet certain criteria. You can literally just select straight from the commit graph and, you know, we're committed as a sort of product like ideal to expose as much of the commit graph in SQL as possible. So that sequence of, you know, updates is itself data and it is fully exposed to the user. So I'm not questioning again, I'm not questioning, oh, should everything be SQL? I agree, it should be SQL. Right. I question whether the end user is capable of making these kind of decisions, right? Like people are stupid. A lot of people are actually don't even write raw SQL. They go through Tableau, they go through MicroStrategy or whatever. So like I feel like you would always need a DBA to be sort of the adult manager. Again, you guys are getting, you're running this, you know, people are running this now, so you tell me whether the average your data analysts can deal with this disk. I think there's an important distinction to me, right? In the case of the example I'm giving here, just, you know, for my way of example, right? A mapping table is like inherently a small thing, right? It's a thing humans maintain to configure the other systems, right? That is the kind of thing that a human can look at the diff, right? And we'll see examples where like when I do a demo, I'll show you something where you can very realistically look at the diff, right? So then and then there's other cases, right? Where you clearly would have to look at the diff in some sort of programmatic way. You'd have to like write a query that summarizes it. Like, you know, if I changed all my portfolio weights in a portfolio table and it like, and the weights added up to more than one, then I know I'm in violation of some constraint, right? And I can have a SQL query to express that constraint at two different points in the commit graph. So I guess my comment is and it's fine, this is the answer. I think the comment is that unlike in Git, where like if it's a text file conflict or even like as you track changes in word and put in a better one, your off-brand average, you know, not really text-heavy person can probably handle that. Can handle track changes. Then you need to be a little bit more sophisticated to handle a diff or conflict in Git. And it sounds like to me with adult, and I'm not saying this is a negative thing, it's just the way it is, that you have to be even more sophisticated to handle the conflict. Yeah, I think that an important thing to highlight is obviously most database interactions are automated, right? Right, yeah. So I think that a lot of these interactions will end up becoming scripted, right? So it won't be the case that someone's there like reviewing a billion line diffs, it will be someone has thought about a data process they want to engineer and they have said, oh, well like here are the criteria under which I consider this a valid step and like let me script that and oh, great. This adult exposes the commit graph via SQL and Ergo, I can express my desires in that query, right? And then build behaviors around that, right? Yes. That's a really exciting thing. Okay, yeah. That makes sense, right? I'm convinced. Like again, most people don't get raw access to the database, but as you said, if it's part of the pipeline, you automate it. Okay, proceed. This is good. Actually, the funny thing you're this is a funny anecdote, but in finance, most people have access to the production database because the firms like the technical, they like the discipline to actually do the work to expose proper hooks. So people end up having to update values manually all the time. So I'm going to just do some like screen sharing stuff. Okay. So you guys can see my terminal, right? Yes. Oh, sorry. So I'm going to dol clone something which I did earlier. So for context, and I should probably actually show you this online before we So I just coined some data, right? We talked about about kind of idolizing the get clone command and sort of as a distribution story. So when we when I first started working at liquid data, I sort of thought the 538 team had some cool data. So I went and, you know, uploaded a bunch of their data to doll hub and then I set up some jobs that basically pulled the data sort of periodically. One kind of interesting aspect of 538 is they're posting data, you know, every hour, every day, every week, however, whatever the update cadence of that data is, but there's actually no and they'll, you know, they'll post a data set with like a before and after. So they're clearly interested in the changes and the values, but the but they don't post a time series. They just post a snapshot of it, right? So if you want to capture this data, you have to either sort of like write a program that transforms it into a sort of time series analysis or you can just use doll and you can just capture the convince. So with that, I went and took their polling data, which I find kind of interesting because politics, whatever, and I uploaded it to doll hub. And so now we've run the clone command and we can just run doll sequel and go into the polls directory. There we go. So I've cloned this repo and now run doll sequel and now I've got a sequel shell and I've got some tables and now I'm, you know, if I'm interested in presidential approval ratings, I know Marist College is the sort of gold standard, according to 538 for polling. So I can sort of dump in and be like, get me all the president approval polls where polls the equals. So now I have a logic that's not very useful. We can do describe presidential approval polls. And so now we've got some. You know, this is just a sequel database like any other. But I think the exciting thing is that we ran one command and we acquired it and now we can like just start running sequel on it. You know, I read a bunch of tutorials about data analysis where they sort of advise you to set up a local sequel database which took like, you know, fully half of the tutorial involved in polling CSVs but this just lands on your desktop and it's running. So, you know, if we wanted to do a if we wanted to do an update we can sort of just press and sequel presidential approval polls set. So, suppose we decided that we think Marist College is now bad. It's no longer the gold standard but it's in fact an F because we didn't like the polls. That sounds familiar. And then we can run that and we've now updated some stuff. And we can step out the sequel shell and we can run adult diff. And we're going to highlight the problem that Andy just so you know, you can by wiping this out you'll be able to see a bit more. But you can see where we basically it's it's cell wise the diff. So we're actually it turns out that like tables are a little easier to diff than the files because they are a little more structured. So we can do like very accurate diffs. So, you know, here we've clearly just updated a bunch of rows all the polls and Marist now have their FT rating set to F and so we can we can then, you know, Oh, sorry. So with that one, I feel like you wanted to show the just the primary key columns and not the whole thing. If they're, you know, it seems because like it's nice that it's color coded but that's it. That's a shit lot of data. That's a very good comment. I'll be which consider. Um, yeah, we so if you go to the CD I'm actually kill my shell. Sorry. This is polls. I think it was just polls. Oh, it was polls. Oh, it was. Sorry. There we go. So, yeah. So if you look at the status now, we've updated the presidential approval ratings and we can we can add it. We can commit it. Updated Marist college and we can push it back to dole. And then we'll be able to use doll hub to view the view the diffs in a sort of friendly UI. Is your diff is it block based or is it are you actually sending like column this the diff that Oscar is sending up with a push command is all block based. But once once it hits some interface, either the sequel shell or doll hub, that's when we kind of impose the column and row semantics on it. But you had an ingested so you're easy to never your monthly fees going to be ridiculous. No, it's not that it's not that it's not as bad as you might think. All right. I'll talk a little bit about how we do structural sharing to avoid ridiculous AWS fees in a minute. But at least on ingestion, you don't need the whole row. You just need to like, again, an internal identifier because then you have a row ID. The column that you modified ship that over the network. That's going to be a fraction of what you just uploaded. That's a potential optimization due to the how the current implementation works. That's not true. The smallest unit of diff that we can that we can send or actually talk about in adult right now is a row. So even even if you have only one column change, you're still going to revision the entire row and I'll get to why that is in the latter half of the talk. Okay. So, you know, you can you can now see although these though, this is a pretty wide table which makes it quite hard to be in the terminal. But you can actually see like on dot hub that we really only highlighted the change rows and this is not that many rows of data, right? So you can very realistically imagine human maintaining, you know, wanting to change 72 rows in a spreadsheet. And if the spreadsheet contains information that configures the system that could could blow them up financially. I think the value of basic version control becomes pretty evident. I think a place where this shines is like machine learning models. You know, where people are using this type of production that's configuring the system's behavior and I think having robust versioning on it is important. I like the idea of adding dash dash skinny and only showing the primary keys. We'll put that on the future roadmap. That's an easy one. Yeah, there we go. Andy, there actually is is a pretty easy ergonomic way that we have under development but haven't released yet. Where basically you write a SQL query and you can so for example, you can say select star or you know, select only the columns in this table that you care about and then get it to commit IDs, to commit hashes and then it'll run the query and only show you the diffs between the results between those two commits. There's also a bunch of fun system tables where you can select basically run the same syntax as dolt diff but from a SQL shell there's a system table called dolt diff the name of the table for every logical table where you can select from directly to really slice and dice the diffs however you want to but that's kind of an advanced feature so we didn't want to go into it too much. That's a UI. That's you guys, the dash dash skinny that's easy. You guys can add that. I'm more concerned about you guys giving more money to Amazon, right? So whatever you can send less data that's the way to do it. Exactly, yeah. So I've sort of eaten up through a lot of time and I think actually for this audience that might have the more interesting content so I'm just going to sort of now you know why we built dolt and the problems we're trying to solve and sort of I will yield my time to my input. Yeah, this is I mean this is totally awesome because it's like this is not how we people normally think about relational data this is so I think this is helpful. Well, thank you very much Oscar. We just have a half an hour. Yeah, I think we're I think we're doing okay here. Let's see. Okay, looks like my slide control works. Everyone can see my slides I assume. If you came late and missed the announcement please don't hold questions till the end. Just unmute your mic and speak up and I'll try to address questions as they come up. If it is something that's very involved I'll punt you off to the end of the talk but please ask questions as they occur to you. So my name is Zach Musgrave. I'm a software engineer at Liquid Data the company that builds DULT. I'm here to talk about the technical decisions we made when building this this product and the kind of trade-offs that they lead to and to discuss some interesting data structures that might be interesting to people who are studying databases like yourselves. So when we think about the architecture of the DULT database product I think of it as composed of three layers to top to bottom here. There's a actually let's go bottom to top the very base layer you've got a block store there's a bunch of code that knows how to write and read blocks from disk in a particular way that is interesting for our use case. On top of that we define a set of libraries and functions that build a table and object and indexing semantics on top of that basic block store. And then on the very top we have a SQL engine which is capable of inspecting those semantic objects and dipping down into the byte storage to retrieve rows come up with query plans execute them etc. When we talk about the interfaces to DULT we also think of them in terms of these layers. So now top down if you're interacting with DULT from a SQL perspective you're going to be talking to it in kind of one or two ways. You can do what Oscar just did in the demo and type DULT SQL and that brings you up into an interactive SQL shell that ships with the product and you can run any standard SQL command and it it ought to work. If you don't want to do that if you want to talk to your data through some other tool chain if you want to use any of the hundreds of tools that talk SQL that have been invented since the 70s then you type DULT SQL server that starts up in a a MySQL compatible SQL server that you can connect to with the MySQL client with the MySQL workbench with anything that speaks JDBC any program that speaks the MySQL binary protocol can connect to this thing and it'll pretend that it's an application datastore like any other that you've ever connected to. So it literally is compatible any product that you that talks to your database at all if you're talking to one of the two bottom layers if you're talking if you're editing if you're making commits against the the DULT repository itself or changing the the schema if you're trying to determine the diff between two different commits then you're going to be using one of the the bottom two interfaces and this is where the the DULT CLI and the DULT Hub API come to play. DULT Hub which Oscar mentioned earlier is the the site we built where you can share DULT repositories and these hook into the DULT Hub hooks into APIs that we built that expose similar functionality to the DULT CLI. So this is where if you're dealing with repositories as a versioned object that you might want to collaborate with others on you're going to be using these commands so anytime I clone a repository anytime I create a new branch anytime I merge someone else's branches changes into mine resolve conflicts et cetera I'm going to use the DULT CLI natively and some commands there we built this to be identical to get so if you're familiar with gets then you already know all the commands you need for to run DULT it's kind of a one to one mapping I should also note that one of the strategic goals for the product is to make available in the SQL shell any command that you can run from the DULT CLI so creating a branch cloning forking pushing pulling all those things will have SQL native equivalents so that you aren't tied to the command line that's a goal of ours going forward the technology that we built these on going from the bottom up again the byte storage and versioning use some pretty interesting data structures that we're going to talk about in depth it uses a merkle dag similar to how get stores its data and using a novel data structure called a prolly tree these are interesting because it gets us two things one it gives us content addressing so if we if we have a particular piece of table information we can compare whether it's identical to another piece of table information just by examining its hash that's cryptographically verifiable and second it allows us to make small changes without blowing up the storage cost as we store all the revision city things and again we'll get into exactly why that's the case very shortly on top of that when we go to define the semantic objects that's the database interacts with the things like the tables their schemas indexes foreign keys et cetera these are go libraries and routines that's map onto this byte storage and interact with these with these block layer storage interfaces everything's basically maps of tuples and and structs and stuff we'll get into then at the very top layer we have the SQL query engine itself which for most intents and purposes is totally separable from the product itself and let's jump in there and get that part out of the way so I'm sure Andy is going to be disappointed to hear this but the SQL engine itself is not the primary focus of this talk it is kind of separable the way we provide a SQL interface to Dolt is this general purpose SQL query engine called Go My SQL Server this was started by a now defunct organization called source D located in Spain they've gone out of business they built it in order to write queries against GitHub repos so they they built it as this general purpose SQL query engine where to integrate with it you define a bunch of go lang interfaces so a SQL database tells the engine these are the tables that I have a SQL table gives the engine a way to access the rows in your table now there's a bunch of other interfaces you can implement to unlock additional functionality in principle this works with an AData source right so we've defined these interfaces in terms of Dolt but there's nothing Dolt specific about the engine at this point at this point it's it's a general query execution engine it provides a parser so it's got a parser that tries to parse the MySQL dialect completely accurately it has a query analyzer to come up with a query plan to figure out to execute the query and an optimizer to make it faster and then it also provides a server so that this is what powers both the Dolt SQL shell as well as the Dolt SQL server that lets anybody anything that talks MySQL binary to connect to this thing and talk to it out of the box we test this thing against SQL logic test SQL logic test is a test suite that was developed by SQLite 3 when they were first getting started they these are the people that kind of hit upon the idea of just writing millions and millions of templated queries that they then run against a known good database product like InnoDB like Postgres and then just compare the results right they're basically offloading their testing burden to InnoDB and MySQL and we did the same thing so when we got to this thing it had about 20% correctness it only passed about 20% of the SQL logic test suite we've since raised that number to about 92% and we continue to get small gains of course there's a long way to go as everyone knows the last 90% of the work is in the last 10% of the of the correctness so we're getting there so two two quick things one is it's not we're not SQL like people it's one dude it's Richard he's the only one guy running SQL like that he's awesome okay thanks the optimizer is that a cost-based optimizer or is it it's purely based on heuristics it's not even based on heuristics it's way more primitive than that we hope to one day get to a heuristic optimizer and from there a cost-based optimizer to have a cost-based optimizer you have to be able to understand the statistics on the table right you have to understand the key cardinality and distribution we don't have any of that yet literally the kind of things we're talking about here is instead of doing a cross join when you merge two tables together we're going to convert one of them to an index lookup that's the kind of optimization we're talking about right so it's very it's very primitive very a long way to go I mean that's what everyone builds the first time it's not a surprise so but like that's important right because unless you're going to show me additional use cases but like if you want people to use the dole if you want people to play in the dole ecosystem and use the dole you know the SQL interface or go through GEDC to connect to you like this is not something you'd want to be doing you know transactions on right this is this is for data analytics and I think that Oscar should be getting all data analysis but now if you're if you don't have a cost-based optimizer why would I ever want to run complex analytics using dole wouldn't I just want to dump your data out of your thing put it now into Postgres SQL Server or something else and then then I make changes and now the shuttle back into dole right yeah this this this is a great point you're you're 100% correct we have a long way to go on performance to to be able to satisfy the use case you're talking about for the time being dole works best for what we call a kind of call human scale data right thousands but not millions of entries the kind of thing where if if I do have a complicated diff a single human can sit down and review the entire thing right to be able to scale up to millions of or billions of rows we have a long way to come on performance you're 100% right okay um and and and and in the short term we do anticipate the people who have the the analytic heavy workloads will offload all that analysis to a faster more performance system something like Postgres or maybe even big table or BigQuery and we're we're working on tools that that make that kind of interchange kind of uh foolproof and and easy to do right now right now wherever you see a CSV shared on the internet like on github this dolt is better wherever you see a API dolt is better over time if we can get a foothold in those use cases we will get better and better at traditional database use cases but those but apis and csv's on the internet are a terrible solution for sharing data between you who's human beings absolutely yeah this is the the classic maneuver Tim's using here which is yes we're terrible but we're not nearly as terrible as the thing that we're trying to replace right so we're still we're still a huge step forward in the in the right direction but you're you're 100% right that we have we have a long way to go for performance for uh we plan to to get there we're about four times slower than my sequel yeah prop and there and there's probably use cases you could find where we're doing worse than four than 4x but that that's a pretty good average yeah on average so that's go my sequel server um if you want to check it out please do like I said we just adopted this from an organization that went bankrupt we're very interested in having people collaborate with us and improve it we are using it for adult but it is a general purpose engine that anybody who has data that can be exposed to can benefit from helping us make better so if that sounds like something interesting to you please check it out it's on it's on github one of their one of their shout outs there's another open source sequel engine which does something similar it's called octo sequel it builds itself as a way to connect any two data sources together you can give it csv and postgres databases and whatever else and it'll let you run queries on them we're trying to we're go my sequel server is notable in that a it provides a a sequel server interface that you can connect to which is actually really valuable and b because it's a read write interface octo sequel is is read only so it's useful if you're if you're just doing reads but if you want if you want a full featured engine it lets you do updates inserts etc and you really have to do something of what we're doing so we started at the top of the architecture diagram now we're going to dive way back down to the bottom and talk about the bite store itself the bite store we we have for dolt is based on noms which was another open source graph database product dolt is get for data but the thing the objects out of which the get commit graph is built are tables whereas the original noms database was itself a graph database which means it doesn't have any notion of tables it doesn't have any notion of schema you can't run sequel on it you have to run graph ql so it's not it's not a native it's it's not a data format that can be easily queried by someone who's used to working with csv's or excel spreadsheets you really have to write a program to efficiently query a noms graph what we've done is built these table and column and index semantics on top of this graph database in order to make it workable and compatible with the rest of the sequel world what's special about noms is that it's a content address data store so that every value can be retrieved by its cryptographic hash in terms of storage it's kind of split into two parts you've got a bunch of raw byte files that are just split into chunks and they just sit in the directory somewhere and then you have an index structure which noms calls a table file that provides an index look up from the hashes of these objects to where you can find them in the files and the offsets in the files pretty straightforward the way noms stores data is like a Merkel dag in that it's a be tree like structure where the nodes both the leaves the intermediates and the roots are all content addressable content addressable is important for when you're building a product that's like get because it means that if you when you're comparing two trees for the for their contents if you find a node that have that there are two nodes in the in the two trees that have the same hash you know that those nodes are identical both their contents and recursively all their children's contents I'm going to dive into a little bit about the structure called a probably tree which is how we get the structural sharing across revisions we think it's pretty interesting so the probably tree is kind of the magic sauce that makes noms interesting and makes adults they are get for data solution possible and makes it makes our maybe a storage cost not balloon as much as you might think they might so probably trees like a be tree but it's better for content addressing so there's nothing stopping you from taking a existing be tree and just making it content addressable that's pretty easy to do the the problem is when you do that you're you're going to blow up the storage because the be tree is relatively inflexible on how they're how they're built so probably tree it stores all its key value data in the leaf nodes just like a be tree it's got internal nodes that have key delimiters and child pointers just like a be tree the difference is that the leaf nodes and the internal nodes are all variable length and content addressed and this is this is an important aspect for how we get the structural sharing and avoid blowing up our AWS bill unlike be trees the child pointers to internal nodes and leaf nodes aren't pointers they're not memory addresses they're content addresses which are referred to by their hash and then they can live anywhere on disk you just follow the follow where they where they live based on the index the the last point here is the one that is really important for for our purposes is that unlike a content address be tree small insertions or mutations typically don't require rewriting large parts of the tree typically you only have to edit one block and it's immediate ancestors in the tree you don't have to edit any siblings and we'll we'll examine why that's the case okay so diagram time this is kind of the interesting slide in the talk on the top we have a be tree this should look familiar to everyone in the audience this is represents a root node and three intermediate nodes child values or the actual leaf nodes are omitted these are just the the root intermediate nodes so you can see we've got we've got a root and three pointer or three children for each of these trees on the bottom we see a probably tree so these two trees store the same information if you look you'll see that the the lower level they both store the value one two five six nine twelve eighteen twenty one but they're storing them in slightly different ways the other the two things which should jump out to you are one the the hash values associated so instead of following memory pointers or direct file based offset pointers to find the children of a node we're getting this hash and following that through an index to where it lives on disk so in every in every node but the root and the intermediate nodes all the children are referred to by their their hash values follow that hash you find the child the second thing which should stand out to you is that the nodes are all variable length if you look up at the be tree this is a a four-way be tree so every every node has four children and if you look down at the bottom every node has a variable number of of children the root has three the one on the bottom left there has one child and so on they have a maximum but they they can actually have fewer than the max fewer than the the chunk size can fit and this is important for the structural sharing wait so for the the hash look up maybe that's the next slide maybe like there's another there's a hash table like that maps the hash to to their their location on desk that's right there's there's this is the second important piece is outside of the the tree structure and the the raw byte storage itself is we have an index which everything that is content addressable in the database has a direct look up to where you can find it on desk and what is that data structure noms calls this a table file but it's basically just a just a hash map you have a okay you you take you you take the the cryptographic hash of the node and you follow to where it lives on desk what kind of hash table that's a good question let's let's come back at that to the end there's a couple other engineers from liquidate on here who's probably better equipped to answer that than I am it's not I'm just curious it's a long standing issue which actually wouldn't use okay yeah well we'll come back to that at the end if we have time so let's look at how we build a probably tree and talk about why it actually gives us some of the nice structural sharing and avoids killing our AWS bill so unlike a standard B tree the the chunks in a probably tree are determinant are probabilistically sized they don't they're not all the same size what you put in each one and how full it gets is a function of a probabilistic rolling hash so if you look down at the bottom diagram we have kind of two parallel rows on the top is a hash and on the bottom is the content and the way we build up our probably tree is by inserting the elements on the bottom and computing a rolling hash of the contents of that chunks so far and we have a rule the rule is whenever the hash our rolling hash value so far falls below a threshold then we stop that chunk and start a new one so for this example we've chosen the hash of 10 in reality you choose this by by doing some math to figure out what average chunk size you want to get based on the size of your of your hash and this you don't know exactly you don't know ahead of time how big each chunk is going to be but on you know over time what the average chunk size will turn out in the long run so as we feed these values in one two five six seven nine you can see that whenever the rolling hash falls below that threshold we draw that vertical line and start a new chunk boundary so here we this gives us blocks of one value four value and three value which match the intermediate nodes we see in the probably tree here okay so these these blocks aren't as full as they could have been we've probabilistically left some space in there for a very important reason let's talk about some of the advantages of the probably tree over a content address B tree the this first one is the big one structural sharing the small changes that we typically make in a in a you know small set of updates we're inserting a few new rows we're editing the value in a third of the rows are usually limited to a single chunk so you have to rewrite that chunk and you have to rewrite all of its ancestors in the tree back up to the root but you're typically not overflowing those nodes into their siblings and having to rewrite all of those siblings and their ancestors back all the way up to the root of course you have the trust and verification where where if you because it's a cryptographic hash if you if the hashes of two nodes match you know that the those parts of the tree are in fact identical and this makes it efficient to recursively diff to trees which is important when we're sinking to trees from a remote repository to a local repository if the content addresses of the two nodes are different then we know we have to recurse and figure out what's different there if they're the same we can just stop and we know that that part of the tree hasn't changed so it's efficient to synchronize efficient to diff these data structures there's here's an illustration about how the structural sharing comes out and kind of the best case the top left here you can see where you've added a single new chunk to the database the new chunk is represented in blue the old chunks are represented in in gray when I overwrite that chunk in the middle with blue I'm changing the ancestors all the way back up to the root but none of the siblings typically need to be rewritten because we left some space probabilistically for new changes to come in this is true not only when adding a single new chunk in the middle but also a run of chunks in the middle or at the beginning or at the end and again this is probabilistic so we can only talk about this stochastically but on average when you when you do these kinds of modifications you don't suffer any spill over into adjacent sibling sibling chunks which means you don't have to rewrite to the entire table looks like Andy has a question I've been picking up the dog sorry okay no problem any any questions so this kind of concludes the section about the base layer byte storage that we use and we're going to move on next to talk about how delt is built on top of this in terms of how we define the columns and rows and so on any questions before we move forward I think we're short in time so I would jump to the next part great so just one quick question on the last one sorry yeah so if you AACD necessarily who you are oh I'm sorry my name is Anthony Tomasic I'm a database guy too but I have gray hair unlike Andy and if you insert at the beginning of a file here such that all the bytes are sort of shifted down does this still apply the structural sharing it applies on average right so on average we don't fill up all these chunks as full as they could possibly go we we choose we choose a we choose a threshold for this probabilistic rolling hash that leaves enough space on average that we don't need to do that you can of course come up with antagonistic insert sequences that that really abuse this thing and cause the storage to blow up but it's optimized for on average not spilling over into sibling nodes most of the time right thank you yep so let's talk about how we use this to implement adult what we've talked about so far is is really raw byte storage how these things are stored on disks and accessed but like I said in the beginning knobs is not a relational database it's a graph database it doesn't understand rows doesn't know what a table is doesn't know what a schema is we built all that on top of knobs and this this next set of slides explains how so we're adult adult repository if you dig around on the adult adult directory on disk you can find these things sitting in a little JSON file but we store a bunch of pointers into this object storage cloud that is knobs specifically the head the working sets and the staged all the branch reps you have to find these are terms that should be familiar to you if you've if you've worked with get it's they followed exactly the same function here you have a working set which is the stuff that's you've checked out but not necessarily committed it may or may not be the same as head it may or may not be the same as stage but regardless it's a pointer into a to a data structure that lives somewhere inside this object storage cloud if we dig into these clouds and follow some of these hashes to the values that they represent you find that there are these things called root values a root value is how adults defines the top level database objects again these are the things that we we need to represent in order to build table and column and schema semantics on top of the graph database that is noms so we define for example a set of tables some foreign keys some other bookkeeping that we need to keep track of things in the database and make it make queries efficient and so on if we follow one of these table references we find that this is a map that maps table names to table values pay attention to the the long nonsensical hash strings at the top of each of these I'm trying to represent the fact that in each case these are content addressed hash references that we then follow using the the same mechanism we talked about in the the prior part of my talk we're just following you can think of it exactly like following a pointer except it goes through this indexed intermediary to figure out where on disk these things live so the root value has these tables the tables point via this indirect content address half pointer into these table struct values the table struct values themselves go further into the object cloud with more hash child references they define a schema which is basically you know what what columns are on this table what properties to these columns have such as their type their nullability all the things you'd expect from a relational database it defines a pointer into a map of table data which we'll get to shortly and then things like conflicts on that table which is kind of a temporary holding cell for any conflicts which might exist as a result of a merge the indexes which are defined on on that table all the kind of stuff you might you might expect from a relational database and here we're at the the bottom layer this is the data the table data that dolt actually stores at the end of the day so pay attention first to the the bottom the bottom part of the slide the the map structure here so table rows are in dolt are represented of a map of tuple to tuple the tuples always are even length because they are pairs of tag number and value so if you look at the top half of the slide you can see that each of the three columns in this example table have a tag number associated with them ID as tag three first name tag 10 last name tag four the reason we associate column values with these tag numbers instead of just the name of the column is so that's things like column renames work transparently and don't cause us to change any storage it becomes a schema only operation rather than a data changing operation but if you if you look at the the top table and the the bottom map these look very similar in each case we're simply enumerating the tag of a column followed by the value that column has for that row so every row in the table is an entry in the map every key column in the table is an entry in the key of the tuple every non key column is an entry in the value of tuple if you'll notice the ID or tag ID three only has a first name there's no last name there in that case we just don't store a value for that tuple pretty straightforward I think this is the last slide so this thanks for thanks for listening so far anybody who's held off with a question we can run through those until we're out of time right just another question so if I understand the structure right all the metadata is automatically versioned but also that's correct so any kind of because you're completely blending the data and the metadata into this get repository so to speak that's right is that correct that's right so the if you look at this this slide here with the table structs these are the objects which are actually that we're storing in the physical graph database that we built still on top of right so the one thing I don't get is how I use this so here's a typical thing I do with data integration I'll have a table and you know I'll run a cleaning operation on one of the columns and then I'll take that output and dump it into some temporary table because I know I'm about ready to do something else with that so do I like check out the table and then execute the SQL and insert it as a new transaction as part of that batch checkout and then commit that table is that how it works you missed the demo in the beginning oh I apologize oh I'm sorry it's uh no don't don't don't be sorry um this is this is excellent for people who are following along at home or viewing later uh the so you think of a adult database as a get repo right so just like you wouldn't check out a particular file with get you check out the entire repo the same logic applies here so I check out my repo I make whatever edits that I that I want to I examine if they are correct or not by running don't diff that gives me kind of a summary view of what's changed and if I'm happy with it then I add the tables that I want to commit and then type don't commit and that commits a a new revision of the of the database right if you want to do this on a branch you type get checkout or don't check out beforehand to start working on a branch and then you can stand that off for a pull request review or merge it back in right any of the get workflows you're used to translate pretty naturally you just think of instead of working with files you're working with tables that are versioned I understand entirely thank you yep okay so this is normally when we we would applaud if we were on campus so we can't do that so I'll applaud for everyone is there any we have time for maybe one more question before I answer as my question if anybody else wants to wants to go at it go for it okay so the the the Prolly Tree is based on the primary key correct the no the the Prolly Tree is is content addressed right so it's it's it's it's a hash of the bytes that are actually stored on disk okay so key key and primary key as well as the rest of the the content of the row okay so so if I want to declare if I want to declare a secondary index how would that work here because if it's if it's versioned like so if I make a change to a row I you add a new little entry here but now if I'm updating secondary index that corresponds to another thing that gets changed is that just treated as a second like another because another one update to one two will cause as multiple changes to to the secondary indexes yeah that's exactly right so if you so if you look at this this diagram you'll note that the the indexes of a table point off into another map of tuple to tuple right we've essentially done exactly the same thing except we so we duplicate some of the storage of the table and we key it on a different value right the set of the set of columns you select for that secondary index and then like in most databases when you when you do a write the cost of it is going to be proportional to how many indexes you've declared right okay but then certainly do go back to the next slide yep this thing here there's nothing about your your tuple on the value pair of your map that says you have to you have to store that as a row store but you now because you'd have to maintain a separate and you have to maintain no no sorry if you're if you're cashed your location map was actually a offset within the column then you could could compute where those columns could appear and then you could actually store this physically as a column store which would give you all the benefits you would want if you want to do analytics or compression if you like that yes this is this is something we talked about a lot kind of in the early design phase and it's a trade-off right it's hard from an engineering standpoint it's harder it's a little it's a little harder but it also has a different set of trade-offs so we mentioned at the beginning Oscar ran that diff and it had a giant spew of output and you said why is this so wide this is a concern it's not a concern for the output and being able to analyze it but also because it does increase storage costs if you look at the rows in this map the addressable unit for being able to be versioned is the row right so if I so if I make a change to one of these values I'm duplicating the storage for all the other values in that row too right and like you said it's simple enough to imagine ways around this we could have column families for example that would allow us to separately version some of the columns on a table and these are things we've talked about but in terms of the engineering trade-offs as well as what they would do to some of the algorithms we're trying to run we decided to go with with row major exclusively for now and we'll probably return to that as customers ask ask for more okay and then last two questions sorry because again this is way different than everything else yep garbage collection is all manual or like like garbage collection is non-existent right now there's the amount of garbage generated by this process is can be relatively large for a local version but the nature of walking the tree starting from the root and pushing it to a remote source kind of naturally cleans up any anything that doesn't have an active reference pointing to it right so when you're when you're running tablecams and updating stuff on disk and then you finally make up your mind and say this is what I'm going to commit and push it to to master you might leave a lot of stuff kind of intermediate representations on disk that don't that aren't referenced by any commit and those are garbage we don't have a way currently to clean those up but the process of pushing it to the remote does that naturally so somebody else who clones that data set will only get you know the the bare minimum data pulled down and this happens as a matter of course just because of the algorithms involved okay I'm sorry can I interject it seems Puyah from Sydney Hi Puyah does not have a mic but has a question that question is about concurrent access to dole and how it works and I'm not surprised I'm going to hand this one to Zach yeah so the the short answer is it doesn't work where it's possible to run dolt concurrently but it doesn't it doesn't work the way that a normal application server would you can't run you know dozens of concurrent connections all issuing updates and rights and expect things to function and in fact when you connect to dole with with the through the SQL server we currently limit you to a single concurrent connection unless you trigger unless you give a special flag at the command line that says I really know what I'm doing I know what I want the way that you work with multiple concurrent users currently is kind of with an analog to get detached head where you're not updating the working set everybody's working on their own head and then if you want to merge those changes back to the working set you have to do so explicitly with a set of special SQL commands or from the command line over time we do we do want to add support for for concurrent interaction and we're currently debating exactly what that looks like and how it'll work and how to keep people from stepping on each of the shows okay that was a good question um the all right so way over time as I apologize and if people have to go to the go on now on dolt hub are you storing it as a bunch of EBS files are you smart about saying shoveling things off to s3 to save money is there any strategy there and that's just you haven't you haven't got there yet that's a great question everything is stored in s3 and in fact if you if you go to dolt hub and browser repository it's pulling data directly from s3 it's not doing any sort of intermediate caching except in memory on whatever service host you happen to hit and this this works surprisingly well it turns out if you do some intelligent pre cat pre-fetching of things that you know you're likely to access you can avoid all the round trips that would that would cause the performance to be really bad so it's not all as it does use Brian can you speak up it's hard to hear you sorry it does use dynamo as well in addition to s3 as a caching player it uses it to store something called man test which is yeah this would list of all the table files but then additionally I believe some in some cases it will store some chunk files in there as an optimization just to be just to be clear Brian is a co-founder and senior engineer at my coordinator so he has intimate experience yeah he's not he's not a rando yeah I I assume so yes my last question is noms got you the got you you know from like you can build off nons and you can get the MVP up and running pretty quickly I wonder if you had a build off from scratch if you if you if you're building based on on mvcc system that you could not just use the deltas for that you generate when you do in transactions under mvcc and that just just becomes the the sort of the atomic unit of a diff right so like because even now if you want to add transactions I mean you could do two-phase locking but this is it's essentially as a single version system is that true yeah within this within a connection yes I mean there's lots of ways we can address it I think the the simplest answer to your question of like why didn't we just use you know row based diff semantics is that it doesn't give us a commit graph we we kind of started with the assumption that the get model was actually vital for what we're trying to do and we're we're coming at this with a with a mission not primarily of building a performant database server but of making data collaboration and exchange possible right and with with that primary goal in mind we we decided that the the get commit graph branch and merge clone and fork those were all indesensible and you can't really get that from from just just the diff of rows you do need a natural commit graph to Zach's point if you recall the mapping table example right we're taking third party data writing into our database then transforming that data to produce derived data and so on and so forth right and in the implicit value proposition is that there is this underlying data graph and you can roll back and diff and merge and like without that I don't it it doesn't you know that that's kind of the core of what we're trying to achieve right it's it's much more sorry this is Anthony it's much more scientific workflows than it is yeah you know versioning of transactions to hide yeah that's why I had to do it yeah correct the reality is without this you can't merge to arbitrary copies of the database right if you do the other way you can get good diff but you can't merge has become really like like unbounded unboundedly complex and sort of the concrete use case I think for mergers right is if you're subscribing to an upstream data set and you're using it in a business process and you have a set of local overrides right this is a really like the devil in complexity in reality right and you can actually maintain your own set and then merge in updates so you continue to subscribe to updates from an upstream data Wow having your own local customizations without writing any customer applications and this is like for a statically different model you're dealing at the higher level like the logical level you could still vet physically different ways all right we are like 15 minutes over time I have to help my wife there's a whole another topic about like how do you actually then allow people to monetize data on dole home I'm sure you guys have thought about that like that's a whole another awesome thing so I want to get props after actually to pull you from from being what it's what Sydney was five a.m. so that's the power that's commitment what's that that's commitment that's commitment to database and that's awesome all right we thank Oscar and Zach and our friends at dole home for being with us today