 The Carnegie Mellon vaccination database talks are made possible by Autotune. Learn how to automatically optimize your MySuite call and post-grace configurations at autotune.com. And by the Stephen Moy Foundation for Keeping It Real, find out how best to keep it real at stevenmoyfoundation.org. Hi guys, let's get started. We're excited today to have Jake Michiko. He's the co-founder of Offset, the makers of SpiceDB. So part of the reason that I want to have this talk is because SpiceDB is a permissions database, which I know nothing about. But they care about linearizability and point-in-time queries, which I do care about. So we're super excited for him to have talk about what they're working on. As always, if you have any questions for Jake, as he's given the talk, please unbeat yourself, say who you are, and fire your question at him. And please feel free to do this at any time, because otherwise he's just talking to a vacuum by himself and that sucks. So Jake, thanks so much for being with us. Good point. Thanks for having me. Hi everybody. I'm Jake, as you mentioned, and we are SpiceDB. We're a flexible permissions database for the internet era. A little bit about me. I am the CEO and co-founder of Offset. We're building an implementation of Google Sansabar. I'll cover what that is in a little bit. Before that, I worked at a bunch of cool places doing some pretty cool things. Most recently, I was at Red Hat, came to Red Hat via the CoreOS acquisition and came to CoreOS via the acquisition of our last company, which was called Quay. So yeah, I've been building developer tools and distributed systems for the past 15 years, and this is what we're working on now. So you may be asking yourself, why do I even need a specialized permissions database? Fair question. So prior to thinking about this problem more formally, this is kind of like what people consider the state of the art. So you've got a simple table and it binds a user to some role, often scoped down to like a document or an organization or some lower level. And then in your code, you'll just query that table and interpret it in the source code itself. So in the code example, I say if this person is an owner or a reader, then allow them to load the document from storage. Some pitfalls of this, obviously, if the list of roles which allows you to read changes, you need to change your code. It's harder to do things like bring in higher level groupings or higher level ordering. But most people actually start with an ad hoc authorization system. So they start with something like this, they throw a few relationships in the database and they interpret them in their source code. As time goes on, they realize that they're doing this over and over again. So you want to interpret the same relationships the same way in multiple places or from multiple different pieces of code. So you extract that into a library. Once the library is built, you can compile that into any of the services that need to do the same type of work. After that, people come to the realization that, well, I need to be able to ask questions about other things, potentially, or I need to be able to ask questions from multiple different micro services that might be built in different languages or might not have the data readily available. Because in the library and ad hoc solutions, you need to be able to go and query the source of truth itself in order to make those decisions. So people end up with a network service. I have a couple of different examples of these out there today, but I'll leave that as an exercise to the reader. Once you decide to make it a network service, you start collecting requirements. Well, what should this service do. It should be multi model. So it should be able to do our back just as easily as it can do point to point sharing something that you might see in like Google Docs, Twitter or YouTube. You want it to work across multiple different applications. An example of this from Google where when you make a request, when you send a link to a document in an email and the recipient doesn't have access to that document. It will warn you and it will say the recipient doesn't have access you might want to share with them first. You want to have visibility into who can do what on what kinds of resources. And of course you want it to be correct. Anytime you make a mistake in calculating permissions it's obviously security flaw. And you'd like it to be consistent. And I'll talk about consistency sort of a lot. I also identified that these are sort of like hyperscale of requirements, or you may have heard the term enterprise requirements. Google themselves have billions of customers. They're doing millions of requests per second that all needed to be protected with a permission system. They're doing trillions of objects. They need to do it quickly. I'll get into just how quickly they do that. They need to be able to do it reliably. So people think about like 99.5% reliability as kind of quote unquote good or like table stakes reliability, but at the kind of scale of millions of requests per second, even 99.5% is 4 billion failures per day. So this is obviously not an ideal user experience. It's the same data replicated everywhere in the world. So by default, if you look at like cloud provider I am solutions, they're not regionalized by default. So this is like in AWS for example, I am is the only service where you don't pick a region before you you set up your users and groups and roles and things. So given these requirements are with these requirements in hand. So people set out to write a system called Zanzibar. In 2019 they published a paper about their work. And it lays out a blueprint for how to build a system that meets those requirements, both in scale and also in multi model and complexity and querying things. So we're building spice DB. Fundamentally spice DB is an open source implementation of Zanzibar. So from a databases background, you might want to think of it as like the test for permissions. So we're not actually doing the actual storage or indexing or querying per se that you might with like a postgres or my sequel, but we are building a layer that gives you this complex rich data model on top and makes it easy and efficient to access the underlying data store. So diving right in. Here's our data model. An overarching schema, which defines the type of objects that you can read and write the relations and relations are how data can relate to other data permissions which are how we interpret that data. And then the data itself includes objects so objects are like nouns everything is an object. We have resources which are nouns that are used sort of on the left hand side of relationship I'll give an example of that. These are objects which are usually users and then we have the relationships themselves which are how these things relate to one another. So the schema you can think of that like relational database schema. It defines kind of like the table structure or the holes where you can define these relationships, and then the data is all relationships itself. So objects like I mentioned are any entity in the system. We have a unique identifier prefixed with an object type. So users are objects documents can be objects groups can be objects videos can be objects etc etc. You can think of it as like a node in a graph. In our schema this is how you lay out your different object types. So this schema has users organizations and sort of like fundamental protection piece or the fundamental thing that we're trying to protect would be documents. Relations are the way that objects relate to one another in a relationship. So an example of a relation is someone could be a member of a group, you could be an editor of a document. You could keep track of who the uploader of a video was, and all of those relationships could sort of combine to give you a sense of who is allowed to have permission to what. This is how we had the relations to our schema. So in this particular case we have an administrator of the organization. We have a document has an organization that it's a part of. It has an owner who was a user, and it has a reader who is also a type user. And then finally we get to the relationships themselves. So relationships are, you know, single pieces of data, which express the way one user or one object relates to another object. In Zanzibar and in our implementation were written as resource has relation to subject, but a human might read it. User Jill is an owner of the document with identifier 123. So a subject is when an object is used on the right hand side of a relationship. It's usually used to represent a user, but it can also represent things like a group or a server an individual machine if you're trying to do like CI workload. Anything that you're kind of pointing to with these relationships and the IDs that you bind your subjects will usually come from an identity provider. This user identifier below is actually my guide ID within Google. It may or may not be the real one. I don't know if I went back and changed it. And then resources can relate to other resources. So this is where you get the idea of like nested groups or folders within folders, those kinds of things. In this particular example we have a group called engineering, and it has a member of the group security so group security is a member of group engineering. But we can also point to relations on other resources. So this is sort of a similar expression to the one that we just had, but in this one we're saying that members of the group leadership and members will often be users are considered managers of the group engineering. Okay, this is all just kind of like a primer. What are we really doing here with all of these objects and relationships. We're building a directed a cyclic graph. So this is just a graph expression of the relationships between sort of people and documents and people and organizations. And a graphical version of the table that I showed sort of back on slide five or something. And finally, we need to talk about permissions which is the real reason we're all here right everything before this was just building up to being able to compute permissions. So permissions interpret the graph in order to make access control decisions. I added two permissions to the schema that we've been building up. I have the edit and the view permission. The edit permission says that anyone who is an owner of a document, or people who are administrators of the organization that the document belongs to should have the edit permission. And then the view permission is computed by saying anyone who is a reader of the document, and also anyone who has the edit permission already. So in that case we inherit the edit permissions downstream relationships when computing the view permission. So if we want to visualize that as a graph. I have these sort of like synthetic nodes edit in where we can just say that edit or view kind of point to these other things. And we can use this to kind of visualize the traversal of the graph. Our schema also supports intersection and exclusion, which I thought I would point out because all of our other examples will involve unions, because they're easy to visualize and reason about. But exclusions also give you some very powerful primitives for expressing permissions in an interesting way. In the top example I've added users who are banned. And you can see that even if you were already a reader and would otherwise have the view permission if you've been banned that permission has been removed. And then for intersection this is kind of a contrived example, but intersection would be like all users who have signed the license agreement, and the users who otherwise have view would be allowed to fork this document, whatever that would happen to mean in this particular permission system. So maybe when people come to you like, like, I mean, this sort of thing up here is like you, users have permissions that it reads right. They imagine there's some common pattern that like, you can always like give someone like here's that here's like a started template, and they can fill additional things is that it's how people use this or they have to write everything all this from scratch. Yep. If you go to our playground, which is just that play dot offset.com. We have a couple of examples that are built in. We have dedicated blog posts to common patterns that we've sort of found from different users who are integrating this. So like one common pattern would be having a super user admin. So then you have to bind all objects to like a platform level object and then make an admin on that platform object. And then you can include that transitively in all of the downstream permissions. So yes, there are definitely common patterns. We do a lot of people would say, well, why doesn't the user, like, why don't I just get a user model for free? Why do I have to define that? And the answer is because not all permission systems even have users, right? You could imagine something that is like, completely machine to machine, or you could imagine something where the users are not the terminus of the permissions computation. So like, we have a permission system that we use that terminates at tokens like access tokens, instead of users so users are just an intermediate. And it's fundamentally up to like the user to define how all of these things will play for their particular application for their particular permission system. Another question is like, I don't know about the space. Like, is this, you know, this definition language you have here is this completely unique to Santa Barbara when you guys are building or like, you know, do other systems that bark up the same tree, have something their own thing and maybe the dialogue is like, like, it's slightly simple. They have to understand the novelty of this. Yeah, these ends of our paper lays out sort of like a raw version of this, which I'll get to, or I guess I won't get to it because it's very hard to understand. But our schema that is a DSL that we wrote and we have a compiler for it, in order to be able to have a DSL to have a language specifically for expressing permissions. Other other systems in the space, don't use something like this because they're not fundamentally about providing guidance on how to interpret a graph. They're usually policy languages. So for example, like, OPA uses rego, which I believe is a dialect of data log for expressing constraints, and then evaluating whether those constraints all come out to be true or false, completely different paradigm from something that's relationship and graph graph based. Okay. Cool. Yep. So yeah, like I said, our schema is a DSL and a custom compiler. What it does is it basically adds type restrictions on the relationships, and then compiles compiles down the permissions to something that gets stored with each object type. The goal here because speed is one of our primary concerns speed and scalability is that we want to be able to deserialize the object type and go. Right. So we want to just be able to deserialize it and start interpreting right away. So the place where we haven't invested in yet our compiler optimizations. So there are probably opportunities to find patterns, common patterns of usage common patterns where we can basically compile these permissions computations down to something either more cashable or optimized for on a different metric. Okay, so that's our schema. Next up we have the API and execution. The API is fundamentally just crud operations on the relationships and then a few specialized operations that are tailored for permissions. So we have the expand relation, which just gives you the direct objects like one level removed from the objects that you've queried. We have check permission, which is the most important that's the one that's making yes or no decisions about whether someone is allowed to access something. Next up, which is the most novel API that we've added versus the Zanzibar paper and look up is a way to start at the subject and walk the graph backward to get back to the resources that you have that particular permission on down below I have a link to our API. It's a GRPC API, and you can see it on the buff schema registry. So here's just a quick example about how we traverse the graph to make a permissions decision. We're trying to see if user Jill has the view permission on some document. I use the same graph from before. And so what we do is we start at view. And we, we walk to the various places that we can get to right we walk to all of the nodes that are downstream of view and see if user Jill appears in any of them. So first we walk reader and we see that user Jill is not a reader, then we walk down to edit and then from edit we go to owner we find that Jill is. So from that point we can already to make our determination that she's allowed to have view on this document. And we'll also check the admin path to see if Jill is an admin on the organization to which this document belongs. Okay. So the next and super important part about our API is we have this concept called Z tokens. So Z tokens represent a specific point in time. As you can imagine, as you know mutations are being made to the underlying data and mutations are being made to the permissions themselves. All of this is happening and sort of like an external externally observable order. So we want to make sure that we can sort of respect that order in order to get that consistency guarantee we talked about earlier. What Z tokens allow us to do, they allow us to make requests with slightly older, but data that still has all of the relevant mutations that we're concerned with their opaque to the caller. They're fundamentally just a serialized proto buff so they're not very opaque, but they are opaque. And the intention is that they live with the data and they're updated when the data is. So if the system that you're protecting data on is eventually consistent as well, then the consistency guarantees that you need only need to be as consistent as the replication of the underlying data. I know that's a lot to swallow. I have an example in just a second. And the most important thing is that they solve the new enemy problem. What is the new enemy problem. The new enemy problem is when you apply old ACLs to new content, or when you evaluate changes to ACLs out of order. So you can imagine if I add someone to a band list, and then I add them to a reader list in that order. I never want them to actually be able to read. But if I evaluate those out of order they'll be able to get access to data that they shouldn't have. So just as a quick example we have a document here called plans. We have two users, one user is called Lex and one user is called Kara. Lex has read access and Kara has admin access. At time t1 Kara removes Lex's reader permission. So Lex should no longer be able to read the document. At time t2 Kara uses her admin permissions to add secrets to the plans. So now the plans contain data that Lex should have never been able to see. At time t3 Lex attempts to read, but a stale ACL was used because the data wasn't replicated and we didn't have a mechanism for bringing that synchronization in line. So Lex is actually allowed to see the secret plans. And then obviously he uses the secret plans to cause chaos, as Lex's want to do. So how would this change with Zed tokens. At time t1 Kara removes Lex's reader permission just like before. At time t2 Kara adds the secrets and gets a Zed token back from SpiceDB. As we mentioned Zed token is an opaque cookie that represents a point in time. At time t3 Lex attempts to read and a few different things happen. The t2 Zed token is sent with the data that Lex is trying to read, because remember that token is serialized alongside the data. At time t3 Lex attempts to use SpiceDB to use a fresh enough ACL. So it includes any changes that were made before t2, which happens to include Kara removing Lex's reader permission. And Lex has denied access because it's seen that the reader permission was removed. And everybody's happy because Lex isn't allowed to go in and, you know, do whatever he did to the building. Okay. So the Zed tokens are comparable like one comes for another. If they're comparable within SpiceDB. Yes. But they're not comparable externally. Alan, you have a question. Yes, thanks very much. If you could just go back to that example with the Zed token, please. How do you protect against Lex essentially keeping an old Zed token and using that with their request? Yeah, so Lex isn't the one who's responsible for passing Zed tokens. So the Zed tokens are being stored and sent back to the permission service by the data store that like by the, the service that's serving the secret plans. I imagine you have like a plans back in service with a get API that itself is the one who's sending the Zed tokens. Okay, thank you. You're welcome. Yeah, the cookie we went back and forth on whether to have any association with cookies, because cookies obviously have a meaning on the web. There's something you give to a user and a user gives back to you. But in this case, they're slightly different. Therefore, back end to permission service synchronization or causal ordering. Okay. So diving into the architecture and implementation. This is going to be a little bit of an eye test. But what this is is this is how all of the major source code components relate to one another. Over on the left we have where the requests come in. We support both a rest API but the primary API is actually a GRPC API. That gives you access to the CRUD operations against relationships and schema. Those CRUD operations I mean they go through a bunch of other validation and compilation machinations, but fundamentally they are talking directly to the data store. So they don't go through any kind of interpretation, like the permission specific ones do. In the top in the red, blue and green boxes, we have the permission specific APIs. So check permission that's whether you where you find out if someone is allowed to do a thing to a resource expand was loading that next level of the graph, and then look up as the reverse graph lock. Those get sent to a dispatcher. The dispatcher determines how to actually solve those those questions those queries that the user has made. They send those out to specialize version of the dispatch API, which then will often send the request on to another space TV node that we hope will do a quote unquote better job of answering the question. And we'll talk about what exactly makes it a better job in a few slides. This is shown talking directly to cockroach TV, but it is an interface with multiple implementations, and a few others under active development. So, like I say we can sit on top of cockroach TV that's sort of like our preferred open source default because it's globally replicated. But we also support like postgres we have an in memory driver and then a few more in development. What component in this diagram is giving out that tokens. The API itself the gray box far on the left. There's a middleware which basically calculates what time stamp this request is being served that, and then gives that back to the user. But so for a given organization do you have to go through a single gray box. No, no. It's, it's relying on the downstream data store implementation to give us a sense of absolute time. So in spanner absolute time comes from the atomic clocks and cockroach TV they have a hybrid logical clock, which untangles at all in postgres we have a custom MVCC implementation that has a transaction counter that's increasing. So that one is sort of the most single point of failure of all of them. But any of the gray boxes can talk to the data store and get its sense of now. And often they don't even have to talk to the data store to pick a time stamp. And I have like a whole example about how time stamps get picked in just a few slides. Just a few more implementation details. So like I said we are a g RPC API, primarily, we do that for the HTTP to parallelism, a pipelining and the type safety. And that also does the request validation from a package called proto siege and validate that sort of like syntactic request validation then we have a whole semantic request validation layer on top of that. And this self is written in going we chose going because we're familiar with it first and foremost go routines make parallel execution and parallel computation really really fun and elegant. It's fast enough. And obviously we did it for the generic support which is landing in two ish weeks. Just kidding we had no idea that we were going to need that so badly. So in the data store implementations we have cockroach post grass in memory and then a couple more under development. In the beginning I mentioned that this is a globally distributed service. So in the Zanzibar paper they talk about basing Zanzibar on top of Google Spanner. If anybody's unfamiliar with Spanner Spanner is a globally distributed acid database that can do like a global atomic transaction. It's fully linearizable and it uses atomic clocks for getting perfect ordering perfectly externally viewable linearizable ordering. We use cockroach TV like I mentioned, there is cockroach TV is slightly less linearizable than Spanner. I could give a whole talk, just on that, if we wanted, or slightly less it's not very manageable. It's not linearizable mantra is no stale reads, and you can force things to be linearizable if you make the transactions overlap, which is essentially what we have to do. Because cockroach TV and Spanner are derived from sort of the same underlying principles for how they distribute data, we can use Spanner performance as a proxy for what we can expect that of cockroach TV. One of the other things in the actual Spanner paper is that one of Spanner's customers at Google called F1 basically kept metrics on the response times that they got back from Spanner. The 8.7 millisecond mean read latency is very good. The 376.4 millisecond standard deviation read latency is very bad. So the reason this happens is if the data isn't already replicated into the node that you're trying to read from, Spanner will go and get the data for you, regardless of where that data is. So this could mean dragging the data across an undersea fiber cable, right, which is where we get some of these higher tailed agencies. And obviously super high tailed agencies make for super poor user experiences. So the solution that the Zanzibar team had, and you know the solution that we've copied is we need to make as few calls to storage as possible. So how do we accomplish that. Well, first this doesn't directly address it but we need to break down the problem into parallel sub problems. Then we can cache each and every sub problem. We can reuse the cache as much as possible. And I'll go into how we do this by picking intelligently picking timestamps. And we duplicate requests. So a lot of times, a lot of things in the graph will converge down to a few, a few clusters of nodes. For example, if you have a really popular group or really popular user or like a really popular object, like a YouTube video for example, you'll end up redoing the same requests from a lot of different places a lot of different times. We want to try and batch our reads to the data store. So it's often just as efficient to load 100 relationships as it is to load one. So we want to pre compute the transitive closure of some of the objects. We do that through a system that's under development called tiger cache. It's not available yet. The Zanzibar paper does it through a system called leopard, which they only use for groups, as far as I understand. So group groups of groups groups of groups of groups things like that. So I'm going to walk through an example of how we break a problem down into some problems. This is pretty straightforward. So the top level question that we asked before was, is Jill can Jill view some document. This is fundamentally a union of a few sub problems. So one of the sub problems is Jill directly a reader of the document. We can evaluate that and say that no Jill is not. We can evaluate another so problem which is does Jill have edits on some document and that itself is another union. It's a union of whether Jill is an owner of the document, which of course she is. But then the other branch of the edit is is Jill and admin of the organization that the document belongs to. And of course she's not. So a few different sub problems there and each sub problem can be evaluated independently, and the results can be combined to make the top level determination. So yeah, as we saw, two of the branches or I guess sort of three of the two of the branches that we explored did not yield that Jill had permissions and one of the branches did. So one of the ways that we can improve our cash hit rate. Remember we talked about nodes distributing requests and sub requests to other nodes, as we can pick nodes that are already likely to have computed the answer. So we do this by putting all instances of Spice DB in a consistent ashering. If you're familiar with a consistent ashering. It is a way to subdivide an address space. In this case, we're pretending that this hash yields an integer. So every request that we hash will yield an integer that's as unique as possible for that particular request. The address space gets subdivided and things that map to portions of that address space gets sent to the nodes that have claimed responsibility for it. One of the really cool things about a consistent hashering though, is that every node can independently come up with the same hashering without having to coordinate amongst themselves. So the way you do that is you hash the members you hash the participants in the hashering themselves, assign them a position or assign them usually many, many, many positions on the hashering. And then in that way, all of the nodes can follow that same algorithm for populating this hashering to get the same consistent view. Okay. So in this diagram, we have one top level problem, which is whether you Jill can view the sum document, but then we also have a sub problem, which was whether Jill could edit some document. In this case, the top level problem went to know to know to figured out that some document edit was a sub problem and then sent that on to know one which is responsible for that, that set of some problems. I'll pause here in case there's any questions. I imagine a lot of people are familiar with consistent hasherings but just in case. All right. So that's with one of the ways that we want to improve cash hit rate. The next way is we want to actually recognize that at any given point in time decisions are actually immutable. So, if we say that before, you know, at time t one, if we computed that Lex had permission to read the document. Any, any permission that we evaluate at time t one will always return that Lex does have permission. Then we want to pick timestamps that already have globally replicated data. So this goes back to that spanner that standard deviation that we saw of 370 some milliseconds. We want to try and pick timestamps where spanner is confident that the data has already been replicated everywhere that it needs to be. And finally, we want to pick timestamps that can be shared, because if everybody were just randomly picking times, they would be very unlikely to pick the same time, and therefore we would be unable to reuse the same decisions. So here's a diagram of how we do that. This is a timeline it reads like many that you're familiar with things in the left happened earlier than things on the right. In that case, we have an ankle that was updated this is similar to Jill issuing the Jill removing Lex from the readers of the document. Then at some point later the document was updated and is that token was issued so that's the second box from the left. The third box represents the point. Trailing point through the sort of data log that's already been replicated globally. So if we read data from any point on the left of that third box, we expect that we can read it locally. And if we read it from the right of that third box we expect that we may have to go get the data from a remote server somewhere. And then the last box is a request that was made with the Z token. So this is a request that a user made request made on behalf of the user that includes the Z token that was issued in box to. So first because the Z token enforces that we evaluate at a time stamp, at least as fresh as the Z token. We can eliminate everything to the left of the first box we cannot pick a time to the left of the first box to evaluate our permissions request. Next we want to pick a timestamp from data that's already been up replicated globally. And again we do that because we want to make a local read as often as possible. One thing to note, if the Z token time were to the right of the replicated globally time, we would have to essentially just pick the Z token time because there is no globally replicated time that we can later. This will you'll often find this one like a permission is updated and then checked immediately like to drive a UI for viewing those permissions. And then the last thing we do is we quantize the timeline to try and create those timestamps that everybody can rally around. So these timestamps are more likely to be cashable, because you're more likely to get many different requests requesting evaluations at the same time stamp. This is a tunable so this is something that you the user of spice DB can set the quantization period. But it's also directly related to cash hit rate, and also whether you're able to find a timestamp that meets all of the other requirements. So in this case, we do happen to have three candidate timestamps that we can use that are both after the Z token was issued, but also at a time when we expect the data to be globally replicated. Okay, and that's kind of yeah that's all the magic behind timestamp selection. So again, you're basically going to the data servers to give you your notion of time. And multiple servers are doing this, and then somehow you're be able to say you just rounding up and say give me the next nearest work 20 seconds or minute or something like that. We're usually rounding down. We use five second quantization periods. That's pretty aggressive. We may change that in the future to get better cash hit ratio. But yeah, we're usually rounding down. And then you don't always have to go to the data store. So for example if you know that the Z token was requested with a time that is sufficiently old to not talk to the data store you can just sort of directly quantize it. The guy come trying to just read later on he's got an old Z token, and locally you know that it's within, you know, two minutes of the last refresh or something like that. Therefore, you just, you block out right then and there. Yeah, basically if the Z token is older than the consistency window of the clocks of our data store. Then we can just say it's sufficiently old that we can just quantize from a period where we don't have to go to the data store and ask it what time do you think now is, because now is totally irrelevant. Yes. Every five seconds you're going to date or what's your time, what's your time, what's your time. You're more frequently than five seconds. But yeah, yeah, we're keeping track of what time the data store thinks it is, what time we think it is locally, what the sort of uncertainty window is between the what the data store has promised us it's uncertainty window is, and then making decisions about what we have to do with time based on that. So for example, spanners uncertainty window is seven milliseconds. And cockroach TVs in the cockroach cloud data store that we're using their uncertainty window is I believe 500 milliseconds. So we take that into account when we pick our time stamps. And you have to do this because the like cockroach is not linearized or it doesn't have a strong consistency right. So you're graphing this on top of it, but spanner does expand you don't need to set token, or that, but it helps our caching. So that's where you get. Yeah. And you still, you still want to pick an older time to evaluate that in order to be using the data that's already cash or data that's already been replicated. So Alan is Alan factory factory on this call is to be the number one database researcher in all of Australia. Alan, are you okay with this scheme. I want to think it through. It sounds reasonable, but yeah, these things get tricky. I'm particularly interested in the idea of having the Z token is essentially a restriction on the staleness. And that's something that is a very good idea in a lot of cases, but you have to make sure that they're propagating carefully and especially I'd be interested in how it interacts with causal ordering because a lot of this you know the the new enemy issue that you you mentioned really a lot can happen with causal transition of information. So you have to make sure that the the times here respect all of that I want to think it through. It sounds like a very good idea and worth a lot of research. So the thing that's sort of novel is that a lot of distributed systems start with something that's eventually consistent. And they layer on consistency right like they layer on things like cookies and tokens to raise their consistency or they do like quorum reads for example. Zanzibar starts from spanner, which is already linearizable and adds this mechanism as a way to improve performance. So it's starting from a much like a much safer posture, and then relaxing that to get better performance. So it's not saying like, I'm going to try and make a determination on my own. It's just saying I'm going to use a consistent snapshot view of the database at a time where there are some external guarantees. So Allen Allen is the number one transaction expert in the southern hemisphere. That's Australia. Okay, that's that's definitely a small set. You're understanding that. Sorry. Take a word. All right. We should talk about how we've done with this whole system. So we've got some real world performance measurements that we've made against offset.com which is running spice DB in the back. So our check API response duration at the 95th percentile is 22 milliseconds, 20 milliseconds seems to be a good goal for for these things that are parallelizable by default you can make many check requests at once. So 20 milliseconds is a nice user sweet spot because that's going to get added on to any other latency. Our API availability, it says 100%, but obviously nothing's ever 100%. You'll see there are some little dips in the graph where we go down to like 99.993% things like that. These are all publicly visible at our dashboard status dot offset.com. We've talked a lot about cash hit ratio. We track it separately between the client and server caches. So when you're distributing from one node to another. First you see if you have it cached on the server side so you don't even have to go and make the network request to the other node. But then we're also tracking it on the client side. So that's the one that we expect to have it because of that consistent ashering. Our cash hit ratio is around 65%. We think we can make this better by getting a little bit less aggressive with our quantization periods and through some other tricks involving compilers on our schema. But we don't have to evaluate these metrics in a vacuum. Zanzibar was nice enough to publish some metrics in their paper. So first is the 99th percentile on checks at safe timestamps. Safe timestamps in Zanzibar parlance are those that include data, only include data that's already been replicated globally. So they're doing 15 milliseconds at the 99th percentile, which is unarguably better than our 22 milliseconds at the 95th percentile. Zanzibar's availability over the past three years has remained above 99.999%. We're not there yet. Part of the reason that they are able to do that so well is because Google actually controls their own network. But we are doing pretty well, I think, with our approaching that. And then unfortunately Google didn't publish cash hit rates for Zanzibar. But Airbnb did. So they made an implementation of Zanzibar called Himeji internally for calculating permissions across the Airbnb platform and they've achieved a 98% cash hit ratio. They're doing things a little bit differently. They don't have Z tokens or Zookies, but it's a nice aspirational goal for us to try to get to something in the like 90% plus hit rate. This is just for fun flight. In terms of how a system like this that has no sort of single point of failure and is widely horizontally scalable, Zanzibar serves more than 10 million QPS. And it does it across 10,000 servers in several dozen clusters around the world. We are not there yet. But we hope to. We hope to build a system that can scale to those kinds of heights. All right. And that's place DB. Just as a quick recap, we're ushering in a new authorization paradigm for people outside of Google. It's relationship based rather than role based. We put the data in the schema together to give you that consistent network view that multiple services and multiple applications can all sort of rally around and put together. We want scalability before express ability. So if you do a system like the policy evaluators that are built on top of data log, that's great. And you can express some really amazing things in those systems. But you're probably not going to get to 20 million QPS, at least not with the same sort of ordering guarantees that you get with Zanzibar and with space TV. And yeah, there's always room for performance improvement. So we've seen how some of these other implementations are doing out in the world, and we want to get there. And that's all I've prepared. So thanks for watching. This is a link to our discord. Our discord is where we talk about development with the community, where we do some of our planning, it's where users help each other out, things like that. And I encourage you to join. If this is interesting from a implementation or a contribution perspective. I will applaud that. We have plenty of time for questions. Actually a question not connected with the transactional consistency type topics. In the part where you were indicating how you evaluate for example if Jill has access essentially by looking at the downstream of various nodes and intersecting those things. What what happens when you have a query whose evaluation requires negation in that. So, you know, you mentioned the one of somebody who, you know, if their band, they must not have access no matter what. How do you do this sort of evaluation because simply looking downstream and finding the union of all the downstream isn't good enough. Wait, you wait for all of the other branches that can possibly negate it to come back with their answers before you make your final decision. And that can happen at any at any point in the tree. You can have to wait, but often you can still make your decision before you get all of the downstream answers back. So for example, if it's an exclusion, and the band branch shows up right away. And you find out that this user is banned and there's nothing that could sort of reverse that decision anywhere else in the schema. You can say nope, not happening. Similarly for intersection, as soon as one branch completes and that user is not on that side of the branch, then you can return there's no possible way the other branch could make this evaluate to true. So we're returning now. Doesn't that require a lot of clever, essentially compiler analysis of the, the structures of the, the formulae. Um, not really. So the way this gets compiled down is the permission would essentially say this is the exclusion is what we call it when it's a set difference. This is the exclusion of this base set and these other sort of operands. And when you have that exclusion you just send out the sub problems for the base set and each individual operand and as the data comes back in you just make your determination upstream or not. And the hope is that all of those things will be cached or many of those things will be cached because the sub problems themselves are, you know, they don't change they're immutable. Thank you. You can go for it. Hi, my name is Steven. Thank you for the talk on the question of resolving for example on the Google Docs example where someone can access the document let's say in the viewer permission. And oftentimes in Google Docs there's the concept of groups so you can be a member of a group a group can be of group. In this arbitrary deep trade that's your system actually bang the resolution time whether I'm a member of a group that have access to the document. In Google Docs an example, or do you actually pre render some type of bit mass so you can control the bang of your resolution time. Great question. As of right now we do not bound the resolution time. So we want to give you a correct answer. Actually, I take that back. There is a boundary in our offset.com version of one second on all requests. After one second we'll tell you we just don't know it's too complicated, but we will never give you like we'll never say that oh maybe or like yeah go ahead and go for it. So we'll tell you we don't know before we tell you like a wrong answer. However, the zanzibar paper and a thing that we're working on does actually handle the nested group problem so that was the pre computing the transit of closure of objects. Groups belong to other groups and those groups themselves belong to groups, you can actually go and you can say for this top level group. What are all of the sort of leaf objects that belong to this group transitively, and you store that denormalization to turn this sort of like nested serial data store back and forth to something that is is denormalized done ahead of time and then you just have access to that data when you need to make the query. The thing where the thing that we're working on is a system called tiger cache. And the tiger cache will allow us to compute that, and it will allow us to also yield a, excuse me, a roaring bitmap, which is essentially a way to. I probably don't need to explain it to this group but it's a way to export. Essentially a database index from one database and put it into another and use that as sort of like a conditional on your query. I see. For example, like in the Google example because there are so many people ops operation you have people joining Google, leaving Google, and you have a let's say some entity document that is controlled by all you have to be a some Google internal won't that be very busy for your cashcast you always almost like oh like Monday morning every week there's a bunch of activity you have to keep recomputing the the pre-rendered bit mass such that you can resolve the membership correctly. Yeah, so it's actually much worse than that. Because this isn't just for Google internal systems this is also for like, you know, Google docs, Google dot com docs, right so, you know, billions of people in the world interacting with Google docs. Or at least you to maybe not Google docs. But yeah, you're right so what they talk about in the Zanzibar paper they have the leopard system that does this transit of closure. And they say that a single right, a single relationship right to Zanzibar will often manifest itself into tens of thousands of rights against that generalized cash. So, yeah, it's a it's a Google scale problem. Cool, thank you. Yeah, I guess a few questions. Thanks for this talk is quite interesting we don't see papers on this topic in database conferences as much. So, I guess one question I had was similar to what Alan said, what happens if you end up in a conflict right like with the with the exclusion that like under the reader role for instance you end up with a negative constraint or negative rule. And then under admin, you end up like there are two parts to that same permission, right, either through as a sort of a direct association of the document, or through like a group. Right. Can such a situation arise like I'm not quite sure how you would resolve conflicts like that. Yeah, absolutely so that that happens all the time. The way to resolve it is, you see, in the schema example. Yeah, so these things have an order, right these operations have an order. So for in this example, you have a path like an exclusion you can be banned, or you can be a reader, right, the way this gets compiled. And then the problem of whether you're a reader gets evaluated. And then the sub problem of whether your band gets evaluated. And if you are banned, then you're subtracted from the group of readers. Right, that's the way to read this. So similarly there need to be a convergence point for any permission where your permission gets aggregated under a single named permission. That's my view. But in the admin case, right, like, if you were an admin but you were also banned, it would just depend on where admin was brought in whether it was on the left hand side of the exclusion or on the right hand side of the exclusion. I'm wondering about a situation where I have something like this where through a direct relationship to the document. I'm in the band mode, but I also have admin permissions with a group that has the same permissions to the document. That's, that's fine. We don't need to, I guess I sort of a related sort of separate way to look at it is, do you can have a formalism where given this basically graph, you're able to consistently show what those permissions resolve to is that something that's in a writer for something. One of the interesting things that maybe was missed was see where see here in the edit permission, where we bring in the organization's administrators. Can you see that. Yeah. So in this case, like, let's pretend there was a minus sign to the right of that that subtracted out the band people. In that case, the owner and the org admin would be evaluated before the people were subtracted out. So in that particular manifestation of the permission expression. The band people would take precedence over whether you went work admin or a member of a group or not. Right. But in the case where maybe org admin, maybe it was owner minus band and parentheses, and then org admin was unioned with that after the fact, then you still have a deterministic order because the org admin now takes precedence over whether you've been banned from being an order. Does that make sense. I think it makes sense in this case maybe I just need to think this through a little bit, because I don't, I don't see how this maps on to the actual graph that you have and what it looks like on that graph. Okay, but maybe I'll think about it and reach out to you. I have a question on that. And I guess I'm happy to discuss it. We also have our playground which is where I would usually go and kind of like live model, these concepts show how it, how it all fits together. And I guess sort of a separate question is how, how complex do these graphs get some of the things I was thinking of probably don't make sense in real world but I'm just curious like, how complex do these things get how deep do they get. I wish I could show you the permissions graph from one of our users because it's, it looks like a like a Hubble deep field image of a far off galaxy. They can get very, very complicated. Interesting. In the zands of our paper they actually talk about at Google. I think they the average length of a policy document is like 1500 lines. Something along those lines. But it's not like the. I suspect those get messy if you have like a cell changes and then, you know, to invalidate those transitive closures and I suspect that that gets more messy with deeper graphs rather than sort of color graphs. Absolutely. There's definitely like a critical path that arises in most of the graphs. Often you can prune some of the more complex branches by whether they were already like given that permission through a union with something simpler. But sometimes you do have to go all the way to the, the bitter end of one of the most complicated relationships. Thanks. Always the premier database professor at University of Maryland, and his beard looks amazing. I have a question. So, like, what are the sequel queries that that your thing amidst what do they actually look like. I didn't like real simple like get this get that are they more complex joins. And then you mentioned bashing to reduce the number of lookups on against the database itself. Are you batching like within a single request like a JBC like batch individual queries are you rewriting them like multiple selects into a single select statement. And if it's the latter how sophisticated is that sort of optimization stuff that you have now. Yeah, um, so the first part was what are the sequel queries look like, we're basically treating these things as triple stores. So they're very simple there. Give me the list of relationships that match the set of criteria, and the set of criteria is determined by what, you know where you are in the graph and what you're directly trying to evaluate. So bashing, we do some very naive batching where we put separate sequel queries together in a sim, like a single network call. The, the, we're not using SQL X, what's the or using PG X, which is the library for Postgres for talking to cockroach TV and PG X has its own concept of batches, where you can basically make a single network roundtrip. That didn't actually save us very much. The batching that we do is like we will load more relationships than we need. If we think it might be useful for a different part. But this is all actually like very naive at this point, and probably a lot of room for improvement there. So if you use the driver to do batch calls of multiple single statement queries, that didn't win anything but some of you are doing some query rewriting to go get more data that could service multiple requests. It's more like, we'll load a few extra rows, if like we have there, it could be that one single row could answer our request already, like if a specific object ID is there. But if there's like something more broad, we can use that just load extra rows. We're also using like type information from the graph itself or type information from our schema to like make optimizations about what we go and fetch, but that's not batching per se.