 The Carnegie Mellon Vaccination Database Talks are made possible by Autotune. Learn how to automatically optimize your MySeq call and post-grace configurations at autotune.com. And by the Steven Moy Foundation for Keeping It Real, find out how best to keep it real at stevenmoyfoundation.org. We're back with another vaccination data summary series. Today our guest is Dr. James Callan. He is the co-founder and CTO of COVEX, a new scalable backend startup that just been founded in the last year. Prior to that, he was a senior principal engineer at Dropbox, working on their data infrastructure. And prior to that, he was a PG student of Barbara Liskov at MIT. And prior to that, he was ranked as the number one Australian computer scientist. I think it was at 2009, 2008. Very prestigious award. So with that, James, we're thank you for being here. It's an honor. Go for it. Oh, I'll say it quickly. If everybody has any questions for James, please meet yourself, say who you are, and ask your question. And feel free to do this anytime. We want this to be a conversation and not James talking to himself for an hour. Okay? James, go ahead. Thank you, Andy. Yeah, I really like the in-person talks where I get that interactivity. I may not see you on Zoom if I'm just in the zone. So feel free to just jump in and shout out. So thanks for coming. I'm going to give a talk today about Database and infrastructure work we're doing at this company called Convex. And I'm going to talk specifically about how we're designing this to work for serverless developers. For folks who don't want to run their own infrastructure or even have an infrastructure team. And at the time of this talk, we haven't even released our beta yet. So this is all very early stuff. And our beta is coming out in about a week or two. So feel free to sign up at convex.dev. And given this very early stage, I'm going to talk about some features that exist in our beta, some features that haven't been officially released yet. And some things will just stay kind of secret source startup stuff. So there's a lot of content to cover here. Most of this will be fairly high level, but also quite technical by necessity. And I'm happy to dig in wherever people want to jump in with any questions. So who am I? Andy's already given a pretty good intro. I'm a friend of Andy's. So perhaps there's an epitism involved here, but I've got a lot of experience building some big infrastructure. In my PhD in distributed systems on a system called Granola on efficient large scale distributed transaction coordination. We worked in consensus protocol work. And then a Dropbox worked in a lot of really large infrastructure. So one of the big products I led there was the migration of Dropbox's data off of Amazon S3 under a system called Magic Pocket. There's a system that stores multiple exabytes of data. So it's huge, close to a million hard drives, a million separate addressable nodes, and has never lost a lot of data ever very reliable system. There's a tech lead for metadata at Dropbox, including a system called Edge Store, which is a distributed database that runs millions of queries per second. And then also a tech lead for all persistent systems at Dropbox. So caching, big data, data centers, multi-homing the company. So we're pretty extensively on designing and operating very large, very efficient and very reliable distributed databases, storage systems, and then I started a startup. So this talk is about how we can make databases that are more usable. And by more usable, I mean more easily accessible, able to be treated like an appliance. And really what are the barriers that get in the way of people who just want to build stuff and how can we make it easier than to just build stuff. So I know these talks are meant to be technical and there's a lot of tech stuff that's going to come into this presentation. But the number one thing I was lacking when I was in grad school and studying distributed systems was actual real perspective of what's going on in the outside world and what people's challenges are. So I'm going to start with some of that. I used to do a lot of advising to other companies on the infrastructure, mostly for free. And I mentor a lot of senior engineers at tech companies and got to observe what's going on within infrastructure teams at this big company. And then I left and started a startup and did a bunch of market research and talking to smaller startups about what their needs and their challenges are. And in this process, I tended to encounter two archetypes of engineers. And the first architect archetype I'm terming the enthusiast. And I primarily encounter these people because I work with folks who are running database teams in various companies. And what characterizes archetypes is these folks were deeply into tech. They're oftentimes in the tech for tech's sake. So they had technical ambitions that sometimes outstrip their actual workload. So typically I'd chat with a team and they'd say, hey, we're building a distributed database or we're writing our own consensus protocol. Or we're going to migrate off of our existing MySQL instance that we're going to deploy a big cockroach deployment in our company. And then we'd chat about what their use cases were. And they were dealing with hundreds or thousands of queries per second or data sets under a terabyte. And mostly my advice was, hey, just this stuff, this stuff on a single replicated Postgres instance on RDS. All of us are presumably these people because we're all on a databases port right now. But we have to be careful to build features that are actually relevant to the customers that we have and the organizations that we work in. And my feeling is that these systems is custom systems typically have a very high long-term maintenance cost. And most companies should not be building deeply custom infrastructure at all. They should be using something off the shop. But when I started looking around for business ideas post Dropbox, I encountered a different archetype. And these are people who just wanted to get it done. They had an idea for a product. They wanted to build it and infrastructure was getting in their way. And interestingly, I found myself in that category. I just wanted to build stuff. And it was a real hassle spinning up all this new infrastructure from scratch. And this category is growing really fast, really fast. So there's more teams than ever that are making software. They're laser focused on customer value. They don't really care about technology for technology sake. And there's a growing appetite just to delegate complexity to off the shop services where they can. But when we'd go to these companies and say, you know, what do you need right now? The message we often hear is something like, oh, we just really need someone to manage our Kafka cluster. And I'd be like, well, why are you even running Kafka? I don't know. Why do you have to have a Kafka cluster to begin with? And it was because of something silly like it was a message tube because the database kept falling over because it had a bad connection pool implementation, very common thing amongst databases. So what these two groups have in common is that typically enforced to deal with details that aren't really relevant at all to their business. And they're doing so fairly rationally because this is not possible to do otherwise. And so they don't have these clean abstractions that allow them to hide away the underlying complexity of data management and end up focusing on building stuff that's really quite tangential to the business they're actually in. So oftentimes I'll hear that, you know, I need control over my systems. I want to build a good product. I need to have high control over my database, for example. And, you know, my retort is that this is really because there's a kind of failure of clean abstractions in the database world. You know, oftentimes I wonder how many users of Amazon S3 know how it stores its files. I'm going to say essentially zero, right? So people just trust S3 to do its thing because it stores your files, and then it just goes away, right? And most customers need databases to store the data and then just go away. So one of the most promising trends underway is this era of serverless computing that we're entering. And this might be quite a departure from how folks on this call might think about how the next generation of developers are building their products. So what is this serverless computing? I guess we could call it a revolution, but it's kind of on the way right now. So this is pioneered by companies like Netlify and Vercel and you can Google this jam stack architecture. Essentially developers are writing their web apps in JavaScript, right? They're being hosted in CDNs and they get kind of deployed with these deployment tools to CDNs. And then these CDNs basically host the website and any function execution happens on landers. And there's actually no server running that the developer can proceed, right? There's a development workflow and then a CDN cluster talking to landers. And you see companies like FornaDB has pivoted to now be like the serverless database. So Forna's website is all that serverless. Systems like Firebase has done a great job in this space. And the systems work great for most static content. Netlify, Vercel, et cetera. Really great for fairly static websites. But it is not great for really dynamic content right now. So static content is a hassle. You have to identify if it generated a build time. And then when you want to use dynamic content, you have to play a bunch of tricks like something called SWR. SWR stands for stale while revalidate. This is a technique that web developers use when rendering dynamic content. You see this means it's just like show stale data to the user and then Forna background thread that fetches new data. And if it's different, refresh it later on. This is obviously clumsy and obviously a hassle to deal with in these systems. In my experience, the query language that exists right now are really quite hard to use. No shade on say Forna, but I find FQL, the Forna query language, quite difficult to use. As a database person, I imagine more difficult for someone with no database experience. And one comment we hear all the time from Firebase customers. One is Firebase customers tend to love Firebase. But they're always talking about the day they have to move off it. We're really dreading the day we have to move off Firebase. We know we're going to have to do it because there's no real schema support. The index is somewhat limited. It doesn't have a full relational model. Doesn't play nice with analytics, et cetera. So what primitives would we actually want in this ideal world of not having to understand how a database works? And I'm going to get into some tech details of how we actually implement these later on. So we need to support relatively complex transactions. We can't just provide a crud interface, create, read, update, delete. We need to support scans and filters and chain queries. Because if you don't support these relatively complex query primitives, developers are going to have to do so via round trips to the backend. And that will be inefficient and force them to develop hacks to make it faster. It needs to be correct. And I know that correct is not a phrase you hear in database circles because it sounds so vague, right? You probably hear something like isolation levels or snapshot anomalies, et cetera, and phantoms. But all the developers care about is that their systems are correct. Meaning the stuff that you show on the screen looks consistent, right? So basically we need to provide serializability or something similar that I have to think about. Caching needs to be integrated into the system. And if caching is not integrated, developers have to build it themselves. And reasoning about caching is really quite hard. You have to reason about consistency, caching validations, and all of a sudden the abstraction is broken. I still always find it amusing that as database designers, it's been all this time building consistent systems, and then in practice, someone has slapped some memcache cluster in front of it, and it breaks all the guarantees, right? So you have this strongly consistent system in the background, and because it won't perform in practice, you're slapping memcache and losing all your guarantees. We have to have gradual ramp-up. This system has to be pretty approachable, right? Because people want to get started, they want something that works easily, and it helps them get their stuff done. But it has to have a full feature set. So we can't have people migrating off the system because it doesn't have schema support or proper indexes or relational model or schema migrations. And lastly, there needs to be an escape hatch. So a database can't be everything to everyone, right? And if you want to be taken seriously, it has to be a way to point your analytics platform at it or to run your SQL query if needed. This is a diagram, this is a horrendous slide. This is the thing that's called database and AI landscape in September 2020. And there's just an incredible number of systems in here. I'm sure by now there's double the number of systems in here. And you can't directly replace or directly integrate with all of these. So we need some common lingua franca like a column store or SQL to allow easy integration with all the other platforms the company did to use. So that brings us to convex, my company. So convex is designed to be reactive. We want to support dynamic content, interactive applications, and you can't just support this by caching and CDNs. The queries in convex are written in JavaScript. I'll talk a lot more about that. We actually create execute these queries on a multi version concurrency control database back end and provide automatic caching subscriptions in validation, which I'll talk about more today as well. I will provide incremental schema support indexes a fully relational model. Basically all the primitives I mentioned unsurprisingly because I wrote the primitives. So yeah, we support them more. And a big selling point of this work is this very seamless integration on the client side. But I'm going to focus on the back end in this talk. So at a very high level, it looks like this. This is just lifted from documentation. Yeah. We see questions and transactions in JavaScript. I mean, maybe this is it's all for procedures. So procedures basically fancy for procedures and I'll go through this in like two slides and if there's still questions, we'll come back. Yep. So basically the architecture looks roughly like this. So you have kind of JavaScript running in say no, it's getting served by NodeJS and it's serving a React app. And that app talks to the convex client bindings, which call into the convex cloud where it executes functions, which are basically store procedures written in JavaScript and runs these on a database. So let's start with these functions to address Andy's question. Because they're pretty core to the system. So the query model here is the terminic JavaScript. And queries look like this, but they can actually be a lot more complex. The queries can be thousands of lines long. They can import libraries. They can have dependencies, right? You write these big queries and essentially at the end of the day, they use a DB object to talk to the database. They can issue as many DB requests as they want in one function. They can write loops. They can read from a table and do some processing and write back to a different table. So why JavaScript? Well, we're targeting developers who want to build stuff without thinking about a back end. And these developers are riding the code in JavaScript or TypeScript. It's just the most popular language in the world right now. It's unlikely to change for quite a while. And so you want to meet the developers where they're at in the languages that they use. By using JavaScript, we can express complex queries, right? So we can express multi-phase queries. And if you can express complex queries and putting data transformations and processing, you don't need to have multiple round trips to the back end. And avoiding interactive transactions is a huge win. You see, there's a lot of research about this kind of one-shot terminally transactions. Because it allows us to avoid long held locks or frequent OCC conflicts. And also allows us to cache results very easily, which is something I'll talk about quite a bit in this talk. As an aside, we can support more languages by Wazen. Not something I'm going to talk about today. So purpose is going to be more on the JavaScript transactions. So why determinism? So we don't actually support vanilla JavaScript. We don't support a deterministic subset of JavaScript. And we're putting some constraints on the developer to make these functions work. So why would we do this? Well, on the right side, it's great. Deterministic functions simplify transaction coordination. If you have a deterministic store procedure, every node is going to do the same thing, give them the same serial ordering. You can always reconstruct the state of the database, given the series of functions that were called in a certain order. And the last three points on this slide are quite related. All involve reads, which is that deterministic functions allow us to freely reuse the query results or partial results safely. So we can reorder transactions provided that no conflicts in read and write sets. And we'll talk more about that when I come to subscriptions. And we're definitely not the first people to talk about why deterministic one-shot functions are good. There's tons of papers about this. I think my PhD thesis talk about this, Calvin works all about this kind of stuff. So how do we actually make these functions deterministic? Well, first, let's talk about how we can run these transactions to begin with. So you can think of these transactions like fancy store procedures. The client writes these functions, they take inputs, and you read the functions as source code with the backend. And the backend is written in Rust. It runs these queries in a V8 isolate running in Rust EV8, which is the JavaScript and WebAssembly runtime. And so there's a client on the web browser in JavaScript. It calls into a web server in Rust. Actually, I'll show you a diagram right now while we go through. This might be easier, right? So the client in JavaScript here is a very simplified diagram. It talks to the, you know, it wants to run a function. It speaks to the convex client. It sends the HTTP request to a backend server. It speaks to the transaction engine, the transaction engine assigns a timestamp to the transaction and reads the source code for the function, which happens to be stored in the same database that is being modified. Then it spins up a V8 runtime to actually execute that function. And then if there's any calls to talk to the database, that speaks back to the transaction engine, which communicates with the database eventually commits it. And I'll talk more in detail about these as we go forward. So we look at this query again. This is that original query I showed you. The list message is query. This is a query that will show you basically all the messages in a chat channel. And this query contains thousands of lines of code. But most importantly, whenever it calls DB dot whatever, that's a call that breaks out of V8 and issues instructions to our transaction coordinator. So obviously JavaScript isn't deterministic, right? You can read system time and JavaScript. You can access random numbers. There's types like weak ref, which can externalize garbage collection state. So we have to fix this. And so the time is pretty easy to fix. We just pick a wall clock time for every transaction and patch in any calls to system time to give it that wall clock plan. I would do a similar thing for random numbers. And then there's some things we just can't support. So we don't support weak ref, for example. But in general, we support a pretty comprehensive subset of JavaScript. And it hasn't limited us all the customers we work with in terms of what they can express. So I'll pause here real quick before I get into the transaction model. If there's any questions here. I'll keep powering through. So as far as the transaction model goes, it's fairly standard multi-version currency control, right? So this is the kind of stuff you'd see in Calvin or foundation DB or probably percolator and basically any database that's not based on scanner or two-phase commit and two-phase locking, right? So you have a timestamp oracle. Choose the timestamp for the transaction. We focus snapshot of the database and we run the transaction on that snapshot. And now we have a lot of multi... We have copy and write multi-version data structures that allow us to do this very efficiently. While the transaction is running, we track the read sets of all the data it's observed, including range scans and index predicates so we don't have any phantom reads. Then when the transaction is finished, we check if we can commit it at the latest timestamp. We know a transaction can commit no concurrent writes who've touched anything in the read set. And if so, reassign the timestamp to the latest time to commit the transaction and return the result. Otherwise we retry. I've gone through this very fast. We're not necessarily innovating in this space here and you can read more about this, you know, go read, say, how foundation DB works or how percolator works, et cetera. So probably the most interesting thing about the transaction model is that we rely very heavily on these read sets and read sets store all the data you've observed in a transaction. They include both point reads, but also range scans on tables and indexes. And so if you don't have range scan, you might have phantom reads where, you know, an item gets inserted in the middle of a table scan that you've observed. So read tests are the basis for transaction coordination for deciding what the transaction can commit, but they're also used as the basis for catching and subscriptions, which we'll talk about soon. So they need to be very fast. And so we've designed a couple of data structures to make this efficient. So read sets are stored in a data structure called a range set. A range set similar to an interval tree data structure. And they store a bunch of ranges and give you a very fast intersection operation to see if a point or range intersects with this. So it was an intersection between two ranges with the conflict we had brought the transaction. We also need to perform the reverse query quite frequently as well. We just say, given a set of rights, give me all the current ranges it intersects with. And we use via structs, we're calling a range app and involves talking about subscriptions though. So I'm going to get into what subscriptions are. So one of the stronger features I think in our system is the subscription functionality. We've found them really quite powerful in developing apps. Often when you're writing a dynamic web app, you want to re-rend the elements on the screen when new data changes, when the data changes, when the query changes and there's new results to the show. So if we look at our old list messages example, this is a query that would implement something like a Slack clone or a chat app. So it's basically saying, give me all the messages in this channel. And this is all well and good, but new messages are going to show up all the time, which means typically you need to keep polling this function over and over. And this is inefficient, it's slow, it puts excess load in the database and its complexity developer doesn't want to think about. So with convex, you can use subscriptions to do this automatically. So you render some elements in the DOM, so you render all your chat messages as the basis of a query. Then if any data changes that invalidates that query, meaning any new chat message gets posted, which will invalidate the subscription and will redraw those elements on the page dynamically. So this is the code for a chat app in convex. This is actual real code. It's actually mostly one line. It's mostly this use query line. And what this is doing is binding the list messages function. So this function here, which is like the stored procedure. It's binding the output of this messages to a variable called messages. And any time messages changes, React will re-render the chat view, which will re-render a whole bunch of new messages for you. So we do this over a web socket and I'll show you basically how this works. So subscriptions leverage our read sets, which is why we need to be very efficient. So every time you run a query in the system, we end up with a read set and a time stamp for that query. And we serialize the using encrypt them into a token, making sure we don't leak any private states. So the client library gets a token back for this transaction, this read transaction. The client can then subscribe to that token. So basically you send the token to the service that I want to subscribe on this token, which opens a web socket. And then if anything changes to invalidate the results of that query, we can use this web socket to send the new messages down. So internally we saw all the active subscriptions. So every active subscription, every web socket is stored in the range app. Every time a new write commits in the system, we check all the active subscriptions. We see it as a conflict. We try to do this very efficiently. So if there's a conflict, it means a write intersects with any of the read sets for an active subscription. If so, we invalidate the subscription, rerun the transaction, send the new results down. So that will give you all the new chat messages automatically. We actually rerun the full query each time. So this is a bit different to say materialize, which does a lot of work with partial results. There's nothing stopping us using partial results here. Just not something would implement it yet. Yeah, go for Andy. Maybe you said that I missed this. You think you're doing complex transactions, but your example you keep showing is like one query, like on one table, right? Yes. How would you be able to pick the output of one query and put it through another query? Again, in the context of a single transaction. Yeah, this is probably a bad, I just took a bad example. So we have tons of these. And basically you might say, the JavaScript code could say, look up a user. So it takes an email address and say, user equals DB.tableusers.filter, email equals blah. First, you get the user ID, then you get your subsequent request. So basically this is full JavaScript, right? You can go and look up a user, look up the channel that's a member of, you know, look up the messages in those channels. Okay. And so basically, the function is what's getting shipped over to the server and that's the transaction, not the DB.table call. No, sorry. Yeah. So thank you for asking the question. The entire function source code is what runs on the server. All right. Yeah. That's a very, very good point. Yeah. So you might have a very big library, you know, like basically, you know, if you're running a more complex app, especially if it has access control, the first thing you do typically is look up a user, look up what, you know, tables, this member of, et cetera. You know, you have a bunch of calls to the database and the entire function runs server side, you know, the isolate. And only when it calls DB stuff, do we break out of the eight and go back to our, our rust code basically. So you're running all this code and the DB is basically like a little escape hatch to get out to the transaction state, breaking the database back, back to JavaScript and then returns the result. Okay. Yeah. That's a great question. Any thanks. So caching, I mentioned earlier that it's really important that caching works well. This is becoming a really big selling point. It actually surprised me with how big a selling point this is. If you look at all these services, you know, Netlify the cell, you know, that's surprisingly cloud flare, but they're all pushing the selling point of CDNs and edge computing, getting reads close to your users. There's a lot of demand for really high performance read queries. And fortunately, if we actually know the transact to the tool and we can actually do a great job of consistent caching. So right now we have caching on the backend. It's not a big job to add caching on edge proxies as well. And we can leverage subscriptions to do so. So basically how caching works is, you know, we don't use memcache, but you can imagine something like every function request that comes in, we dump, we run the query, dump the result in the LRU cache, and we're mapping to see the function name, the function inputs to the result and the token. Then when a query comes in, we look in the cache and see, hey, is there any cache result that has that function and the same inputs? We get the token back in the result. We check if that token is still valid. So check if that read set has been invalidated. If it hasn't been invalidated, the cache results correct and we can return it to the user. If not, we just run the query. Now there are a few complications with caching. For example, we have to expire results that aren't timely from a walk block perspective. And so the example query I like to use here is about this query to give me all the overdue library books. So the overdue library book query I didn't write it out, but it looks something like, hey, can you give me all the books in this table where the return date is less than the current system time? And if we just cache that query by default, it will never get invalidated, right? Because it's pinned at a certain walk block time, at the time the query was executed. So we have time-based invalidation also. So we throw away old results if the time is too old, older than a few seconds. We only do so if the query is actually accessed at the system time. Most queries don't access system time. We can cache them forever as long as the read set is valid. And only if we're running the transaction ourselves, we're going to check the query of access system time if so we don't have to keep in the cache for too long. So I wanted to use caching as an example to make a point about composable abstraction. This is going to be a bit philosophical here. But after a bit of refactoring in the code base, caching took me about a day to implement. And so why is that? I'm going to tell you right now, I'm not a super fast coder, right? And caching is quite sophisticated. It's transactionally correct caching. And this is the kind of project that would often take a very long time to implement most databases. And the reason why I think it was easy, because I like to think we've come up with a set of highly composable abstractions. And a composable abstraction are those fit together cleanly when you can add new functionality without adding complexity. And oftentimes you're surprised that things can't just work. And one example of a very composable abstraction is that we store the metadata for the database in the database itself. So the source code, the JavaScript source code for all these stored procedures gets stored in the database in a table. It's a special system table, but it's in that regular database. So when a function executes, the first thing has to do is read the source code, which means the source code ends up in the read set for that transaction. Now, you might say one concern with caching is you can change these stored procedures all the time. You can upgrade them. They might have a long chain of dependencies and a dependency gets upgraded. So you might worry what happens if we return the result of a return a cast result that's been generated by an old version of the code. And ordinarily you've built a lot of checks in there. You'd have code version numbers and you'd make sure we're running the single latest version. But in this case, we don't do that at all, right? Because if the code's been upgraded since that cast result was generated, it'll be a read set conflict. Because the first thing that the function did was read its own source code. So an abstraction like this allows us to eliminate a lot of complexity. There's countless examples in the system where just that little decision to store our own metadata in our own system has made implementing the system a lot more simple. Now I see the chat message coming through. I'm going to... Steven, do you want to read out the message, the question yourself? Sure. One of the key mention is the system is to ship the source code of the JavaScript to run on the server side. And kind of the recent problem in the NPN space is there are a lot of package squatting, namespace squatting. And if one of your customers package is infected and they say, oh, shit, we have a developer sneak into a bad infected package, like, of course, now they have a security vulnerability, maybe so they have to record a package. Like how much cash would it destroy in your systems? Yeah, I mean, so there's a few things to look like for that question. So the context is that the packaging in JavaScript is really annoying. And NPN is quite the people to work with. And so we have a bundle. The first step in answering that question is we have a bundle. So when you write a function, you might have all these dependencies and they might refer to different versions of packages that are pulled in by NPN. And sometimes the same code might talk with two different versions of a dependency and the same function somehow. But so we have a bundler. That's a Webpack s thing, which bundles up the function and shifts it to the server. Then we run the code in an isolate of the environment. So it can't touch data that belongs to other customers and it can't break out of access control. So I'm not going to talk about access controls in this talk, we have access control over this. As to whether you could write bad code and it do the wrong thing to your database, yet that can happen. Just like you can write a bad SQL query that has a drop table statement in it. And so the key here is having an actual development process, like a stage where all that process where you can actually test these things before they roll out. I don't cover it in this talk, but we also have a development workflow where you push things first to a staging instance and you can run this on the floor for the data and make sure it's correct. But essentially, yes, if you have a bad query, you can do bad things to your data. And this is a tension you bring up between expressivity and safety. The more expressive your query language is, the more things it can do. Hopefully that answers the question. Cool. So one principle we'll talk about, we care about a lot in the system is incrementality. And this is the ability to start with something simple and approachable and then extend it over time to something more sophisticated in a way that doesn't feel happy. So we'll talk about schemas. Generally, there's two approaches to schemas. There's the Mongo and the Firebase approach, which is kind of dumped a bunch of JSON in a table or a collection. And this works great at first. This gets a lot of traction with developers that's easy to use. But oftentimes later on, they regret the lack of schema enforcement. And then you have the more traditional database approach where you define your schema before you get started. Oftentimes before you get started, you don't even know what schema you need. And this is not very compatible with early stage product development. So we support both in our system. So you can just dump your JSON into the system, but everything's actually fully typed internally. And then you can codify the schemas incrementally. So when you write data to the system, we track each row you insert and we compute the union type of all the data you have in a given table. So we have this phrase, least upper bound in the top lattice. That sounds very complicated. I guess it is. But basically compute the simplest possible union type that can represent data in the table. Here's a very simple example, a very, very simple example because schemas can be more complex than this. And this is just basic pseudocode. So first, you're inserting an object in the system. It's an object with one field called name and it has a value called hello. And so the type for that table in the system is an object mapping name to a string. And then you insert another object and it has a name field, which maps to an int, seven. So then our automatic schema tracking now says the schema for the system is the union type of int or string. And then if you go and delete the original row, there's no longer any strings in that table. The type changes to an int. So this gives us a lot of flexibility. Basically what happens is you start writing about a system and you can see at any time the union type that represents what's actually in your table. I'm actually not going to tell you how we do that. I'm just going to keep this as like a magic secret source. Maybe you can talk more about it later on. So schema migrations. So the data is all typed. And if you actually want to enforce schema rules, you can just go and codify the schema. So yeah, I actually want this to be named, I think codified and will enforce it. What if you want to change your schema? Schema migrations are a huge pain in most databases. Online schema migration usually doesn't work. My experience at least didn't say my SQL. Generally what you do, even in very large companies, if you want to make a change to a schema, even something simple like changing a table name or dropping a column is you make an entire clone of the table, which costs a lot of money because you need hardware to do it. You make an offline change to the schema. You sync it up to date with the latest data and then you promote the clone to the new primary. And this can take months for charter databases. I've seen this take quarters. Schema migrations is a huge problem. It's clearly a problem because the planet scale folks who have a lot of experience running databases, they make this one of the biggest selling points. So the planet scale onboarding is really quite good. They're onboarding docs. And one of the big pitches is that you can fork the schema and test the schema migrations, et cetera. So in convex schema migrations are a bit different. You just define a migration function and you're done. So you write a function, for example, that takes a union type of in and string, so it turns into equivalent and actually schema migration. We run it in real time. So there's actually no reason why this has to be run in the background. There's no reason to kind of force database and process all the data. If everything's stored in a consistent serial log and we can track which data has been migrated and if you try to read data that hasn't been migrated, we can just in time translate it. We can do this in real time. There's complexities there around maintaining multiple indexes, et cetera. But having that information, the type of information makes it very easy for us to do schema migrations. I'm just going to pause you. I think there's a question in the chat. No, we're all good. So your theme is just type. You can't do not nulls, constraints, other things. Not nulls. Oh. Scheme enforcement. Yeah, so basically the schema will track if there's nulls. Internal, we have two concepts, shapes and schemas. So the union type is actually a shape. So we understand all the others that's been put in there. So if you have two objects, one that maps name to a field and other maps address to a field, name is going to be string or null and the rest is going to be string or null. Then when you do the enforcement, yes, you can make it not null basically. But what I've been talking about really is shapes which is type understanding. Once you have that type understanding, you can run predicates against it and force predicates. I guess as a follow up question, the way you've been describing schema so far is you directly infer from the data. I guess typically the way we learn about schema is you define a schema and the database informs it enforces the structure of the data. So I guess is there a reason why that's, are you guys doing that or is there a reason why you guys are choosing this bottom up approach? I mean, there's really no reason why we couldn't add that as a feature if you wanted. There's no reason why we couldn't just allow you to define your schema a priori. It's in days with work probably. It's a deliberate design decision to start in an incremental fashion. So to be able to write your system, typically what you're doing is you're writing software in JavaScript. You're writing functions. Your schema really when you develop it is codified not in the database. Your schema is codified in the TypeScript types that you're using. So basically you have a chat message and there's a message type in TypeScript. So you can write that message type to the server and we'll basically infer the shape based on what you've uploaded and you can codify that. There's nothing preventing us also then saying if a priori you want to go and actually pre-define the schema, there's no problems. Just in our experience with the demands and the developers are normally the other way around that they just want to define their types locally as TypeScript types and then have the database eventually allow them to codify them. Is that covering anything, Eugene? Thanks. Cool. So scaling out the system is something I'm not going to talk about all today because there's fairly standard ways of doing this and again you can go look at foundation or Calvin, Fauna, Percolator, the system, how they scale out. This is to have a centralized time stamp oracle and how they scale out. So basically your serialization path has to be semi centralized. Typically if that centralized service all it's doing is assigning time stamps and checking resets. You can support almost any customer load on a single replicated machine. You can scale that as needed. It's extremely rare that a customer can outstrip the time stamp assignment bandwidth of a single replicated cluster. So once you've got your critical path doing your actual transaction coordination everything else can be farmed out. So you can put your caching and indexing servers on SSD host. You can move the execution close to the data and actually the underlying data can then sit on an object store and have, you know, in the latest versions of cash on SSDs. Ultimately not too many customers need a giant distributed database. If they do need a giant distributed database oftentimes it's used for analytical use cases. We can do this to this last primitive which is the escape hatch. So we don't want to replace every analytical tool or business intelligence integration at least for now. It's far easy just to, you know, play nice with these systems and allow them to access your underlying data. So it's also a pretty bad idea generally to mix OTP and OLAP workloads. They have very different access characteristics. It's a flex complex buffer sizing resource allocation. And that's okay. We don't have to solve all the problems at once because there's an object store underneath that stores the escape hatch. So basically if the bottom of that system is a column store, all the clever fast stuff happens on our like, you know, implementation and SSDs and expensive machines. At the end of the day the data is just dumped on an object store. Totally happy for folks to point their, you know, analytics cluster at it. Or use a SQL interface to query. Now I've got a section here. This is the live data? Like the live data is in a column store or this is like you have an export function? It's flush. Yeah, asynchronously. Yeah. Yeah. Yeah. And is it using Parquet files or what format? Yeah, Parquet. We're actually messing around with it right now. So yeah. Yeah. Yeah. Yeah. We've in the past, you know, used other Parquet. I have a question. Does your database is launching with change data capture support or you're always going to have to do a batch based export or you're just using timestamp to expect to export the micro, the micro batching the incremental changes back into the object store, which we are. We're launching without any of that. So I mean, and just for context, I mentioned at the very start, but like we have like a bunch of features that are going to stage out as time goes on. Right. So the initial release basically will have the JavaScript query engine will have caching will have subscriptions has type support does not expose our codification of schemas yet. So this has a shape support and doesn't support doesn't expose the kind of object store. And those are things that we have to kind of stage out as time goes on. Because the startup and we're gaining experience to go what resonates and we kind of want to launch these features incrementally get feedback on what people are really liking and we have to make make changes as we go. So the answer is I don't know. Yeah. I don't know. When you think of interface with that look at teacher, you know, supporting that. We we had it. We believe that we'll go back in. And so in terms of actual, we also had a graph QL interface for this. And this is the interesting aside. When we started working on this, we thought there'd be a lot of demand for graph QL is a query language. And so we have a graph tool and points allow value to, you know, as an alternative to a JavaScript function, just use graph QL. Our experience is being the graph QL. Probably everyone knows this, but, you know, nice convenient query language, pretty, pretty clumsy, right, language, right to do mutating the graph QL is pretty difficult to write efficient queries and graph QL is a little bit tricky. And our experience has been that there has not been a lot of customer demand so far for like a graph QL. So that stuff has been pulled out for now. We're going to launch without it. But yeah, these are things that we can lay it back onto the system. I think the SQL interface you have before, is that like homegrown or is like you put a Postgres foreign data wrapper in front of something like it was a native. It was the off the shop wrapper on time. We had a translation layer. And I forget, I forget which one we used because I found a suje implemented that part. I totally forget what the layer was. I can follow up and I can follow up with, you know, it's not urgent. Period. Cool. So now we're at like a turning point in the talk because I have a section on testing, which I think is quite interesting, but I can give a whole other talk on it. And so we've got 10 minutes left, left, right, Andy. So maybe it might make sense to pause the questions. I can talk, you know, more about things. I'll actually skip forward to where are we towards, you know, I mean, I'm very curious. Okay. Yeah. Okay. Okay. Okay. So this is how through testing. And I'll try to, you know, cover what we can. You have like, you have like 12 minutes. 12 minutes. No problems. We got this. So I actually put this in because I know Andy. I knew Andy likes testing. And that's quite relevant to us because if you want an abstraction to hold, it has to be sound has to be correct. If your system is not correct, all of a sudden the developer has to understand what goes on inside. You know, ask me how many Dropbox developers know the internals of my sequel. It's a lot, right? Because there's weird bugs and stuff. And you have to learn how these things work. We want this to be a very sound abstraction. I've given a lot of talks on testing. Heavily involved in it as an engineer. And I typically believe the best approach is to multiple layers of testing. And I'll talk about all three layers. The first is algorithmic testing. Algorithmic testing, yes, it's unit testing. There's also some more sophisticated techniques for evaluating correctness of algorithm and data structures within your code. And so human beings are pretty bad at writing test cases. Partly wise because they often make their own oversize into the tests themselves. So the test will pass, but they're testing the thing that you expect. So there's a system called QuickCheck. QuickCheck. Can you hear me still, Andy? QuickCheck 10 out of the Haskell community, I think back in 2000. A lot of our work is actually influenced by Haskell and determinism. So QuickCheck is basically files testing. You define properties for a function that have to be upheld. And you construct a testing library, how to construct random inputs of that function. And then it runs that function over millions of inputs to see if you can trigger any bugs. So here's a simple but very powerful example. This is copied and pasted out of our code base. So this is your code. This is a test that a value. So the value type in that system is this type structure that can be very nested in quite complex. This is a test that the value can be sterilized to JSON and then deserialized back from JSON. And it gives the same result back. So this query says, take the value, serialize it, deserialize it, making sure it equals to the original value. And this might seem like a very straightforward test. We've actually used this to find bugs. Because serialization is quite a complicated task. Especially with stuff like floating point numbers and NANDs, like modern numbers, gets quite difficult. So this has been a very useful test. And it's basically one line of code. So I'm just glossing over an important detail here is that you can tell QuickCheck, hey, run this on a whole bunch of inputs and it will tell you whether it works or not. We have to tell QuickCheck how to construct a random input. And this is this function called arbitrary here. This is the function signature for arbitrary. And there's quite an art to implementing arbitrary as you might expect. This arbitrary has to implement some kind of random distribution of values for your inputs. But that random distribution should focus more on the corner cases. So if you're testing floating point numbers, you probably want to test more numbers around zero or negative numbers or NANDs or numbers at the extreme. So there's a bit of an art and thought that has to go into designing arbitrary implementations. This is kind of a pro tip we've found is when you're implementing a structure say like range set, or we have some really kind of clever, efficient, lexicographical ordering of types that are very fast, but they're actually quite complex algorithms. What we actually do is first write a really brain dead version of the algorithm. Write a very simple version of the algorithm. We test this, make sure we believe that it works. And then we use QuickCheck to validate the simple version of the algorithm and a complex algorithm to give the same result. This has been a very powerful technique. The third thing I'll say is these libraries give support for minimization. So generally, if you have a complex failure case, there'll be an automatically pool input out of this to try to find the minimum possible reproduction of the error case. It's an exponential time process, but we find it yields good results. So this is the maybe more interesting part, which is a bit more maybe area where we're innovating. So we have a system called Pedent, which is the integration test version of QuickCheck, basically. So you need to do end to end testing for your system. So basically, Pedent runs pathological workloads on the database and ensures the execution of these workloads is correct. But life is a lot harder in integration testing with the distributed systems, because you have to introduce errors, such as a node failure or a network drop. But you also have to examine all the interleavings of execution threads and RPCs in your system, which is something that almost no one does. The Foundation DB team did a great job of this, and they wrote a lot of work on this. They actually have a startup on testing right now. At Dropbox, we did a lot of work on this in the desktop client, and there's a blog post that's great. You can read called Testing Sync at Dropbox. The reason why this is both so powerful and why mostly no one does it is because it's very hard to do, almost invisible to do so, unless you've built your implementation around this technique. And so what you need to do is have what we're calling a virtualized runtime. So what we want to do is basically allow threads and tasks to interleave deterministically in test. So a runtime is a structure that we pass around our code base. If any time you want to access the system time, you have to ask the runtime for a system time. In private, it will give you the system time, and there it will give you a fake time that you can manually increment. Every time you want to spawn an execution task, you have to call the runtime to do so. In private, it will spawn a future and run it. In tests, it will basically allow you to manually adjust the schedule of these tasks. So you run into a lot of complexities here, like v8, for example, is one complexity because v8 expects to run in its own execution thread because they're going to have time slices to give to v8 to run within. But this is a really powerful technique. Allows us to validate that correctness a bit of time, allows us to validate that correctness of the distributed system basically, but it has to be built into the design of the system. You have to invest in an Aprilary. You can't take a database and slap this on afterwards. It's too difficult. The last thing in testing is production validation. Something I'm kind of most passionate about. I think a lot of people don't care about that much. At the end of the day, we're selling an abstraction, and that abstraction is if you put data in your system and you query it back, you get the correct results. And that abstraction breaks down if the data is not correct. And anytime anyone who's spent a lot of time with large databases will tell you that there's a lot of complexity to deal with, a lot of anomalies, a lot of incorrect data. And the real dirty truth is that most systems you interact with on the internet have bad data in them. They have bad data in them because of bit flips or CPU logic errors or bugs or some old migration from half a decade ago. And there's some systems that don't. I can tell you magic pocket, the storage system with Dropbox does not have any crop data in it whatsoever. And that's funny because the systems where engineers have invested in a very, very thorough validation process that continually scans over the data in production and makes sure that it's correct. It's interesting from an operational perspective but also from a database design perspective because it means it's important to design systems that are easily validated and typically systems that are built on like a linearized, serialized log, right-of-head log. These are a good fit for validation because you can walk over the log, make sure the data is correct. That looks like what you expected to be. Systems based on eventual synchrony or distributed hash table that typically are poor fit for production validation because it would be very hard to determine a basis what is correct in the system. And so this is a whole hour long talk. I'll give you about this. There's like three house to things I've written here. There's a blog post. There's a talk and there's a book chapter. And so if you're an operational person that runs databases and production, I'd encourage you to read that. So that's convex. It's a database for building. It's launching very soon, a couple of weeks. Support native JavaScript transactions, built-in caching, automatic subscriptions, schema migrations, an incrementally deployable feature set. But most importantly, we believe it makes databases more usable for developers because it provides them the abstractions that we think they need to ignore what's going on inside the databases. And that's that. I will clap on behalf of everyone else. We have time for a few questions. So, Steven has a question in chat, but I won't open up anybody else before. Yes. We go to Steven's question. Anybody else has a question? Steven, go for it. It's convex ready for the Japson test yet? Not yet. I mean, we'd be totally happy to do so. Yeah. Yeah. I think it's too soon. I mean, it's not relevant because we have a lot more features to come. And I think, you know, we can have a third party run tests. But we have to do it anyway as we add more features. But yeah, I think some of the testing you do is a little bit similar. And maybe it's more comprehensive because it's much easier to test a system that's built around testing than a system that's not. Yeah. What is the backend data with this one? So you mentioned rocked. Like, is it 100% from scratch? I mean, you can't write your own storage manager. And what is the backend server sort of look like? Most of the work is on the transaction coordinator. And that's in Rust. And so we're, we're, I mean, there's a prototype version is actually just using postgres on the backend, but we're replacing that eventually. But yeah, it's basically, you can think of the, the secret source. The magic is mostly in the transaction, the read set, the transaction coordination, coordination, the multi version of currency control. We've all like built this stuff before, meaning like we've built object stores and stuff before. We don't really want to do that again if we can avoid it. And so it's unlikely we're going to build a distributed storage system, right? We'll be to distribute a caching system and correct transaction coordination system. But yes, I mean, ultimately we plan on this and letting the customer decide whether the object store goes into S3 or Google cloud storage. And then we'll use as much off the shelf kind of database table management stuff as we can. And then, does that mean also then like you are, like again, it's serverless that the customer doesn't see that they're provisioning machines, but you had to route it somewhere. So like- Yes, we do provision live. Yes, so right now we can get a backend up in a couple of seconds. I'm sure we can make it quite faster. And so in fact, what happens is we have a development flow. We have a program called Convex Client. So you have NPX, Convex init to create a new database instance. Convex push to push your source code to the backend. When you call Convex init, it provisions live with the database. We can do that really quite fast. Most of the delays and stuff like setting up DNS records and most of it's actually really quite fast in terms of spinning up the backend itself. It sounds like right now you assume that one copy equals one node, and therefore you don't have to run distributed transactions spending nodes. I mean, you're set up to support it, but it sounds like you're not doing that now. Yes, yeah, exactly. OK, very good. And my last question is like, so with Rust, surprising that again, you're only limited to transaction layer in Rust, but you didn't choose Go and a lot of the cockroaches that those guys are using Go. So I guess you can talk a little bit about what is being used to break up with that. I actually wrote a blog post on this, Andy, like a couple of days ago, why Rust. And so we might have talked about this in the past, but I'm interestingly more of a Go developer. Most of the magic pocket was written in Go. The storage engine was written in Rust, and the desktop client rewrite was written in Rust. My co-founder is kind of the Rust pros. I think what has been really powerful for Rust and Rust are a few things. One is very tight control over runtime overhead. So knowing exactly how many resources we're allocating has been a huge win. The type system has been massive, massive assistance. One thing I wrote in this blog post is that what we've found in the past is when we've prototype systems in Go, we've often written a prototype and thrown it away and started again once we knew what we were doing. In Rust, we've found that we're able to kind of refactor our prototype into the system that we want because just refactoring in Rust is so much easier than Go. Like, Rust's refactoring is off the charts powerful compared to a language of that strong type system. Typically in Rust, you make a change and then you just follow the compile errors and then until the errors are fixed, it generally just works. Rust is also far more conducive with some of the testing strategies around it. It can be done in other languages, but being able to have much more control over the lack of a runtime. We have our own workshops on time. So, yes, we could have written this in Go, absolutely. I am a bit faster in Go, I would say, but the other folks are faster in Rust and certainly have no regrets. I think it's been great. And the other thing I found is that it's the standard libraries in Rust, really quite mature. I mean, the people who are because it's a smaller community, they tend to be kind of high quality, kind of a more focused set of libraries and kind of crates, as they call it in the Rust community. So, Rust has been a great choice. If people are thinking about switching languages in their company and like switching from Go to Rust, you've got to be careful. There's a lot of cost that comes with that and you can go to the convex.dev blog and I have a blog post that times you may not actually want to switch languages and whether the overhead doing so is too onerous.