 My name is Kyle Brown. I'm an engineer at Singlestore. And today I'm going to talk to you about package transparency and WebAssembly registries. To do a brief overview of the talk structure, we're going to begin by explaining what WebAssembly, commonly abbreviated as WASM, even is. Then we're going to step back and ask the question, what is a package registry, and what's kind of package registry that WebAssembly needs? I want to propose a simple, if maybe somewhat surprising answer. Then we're going to talk about applying some ideas from certificate transparency to package registries, talk about package transparency itself, this term created by combining the two, and WARG, a protocol implementing it, and wrap up by showing how the properties of package transparency help us mitigate different kinds of attacks. So jumping in to WebAssembly, what is WebAssembly? Well, the simplest way that we can really explain it is that WebAssembly is a platform agnostic compile target. Or in other words, it's something that you compile programs to. You could write it by hand, but you're not likely to write a lot of by hand for the same reason you don't write a lot of machine code by hand. It's not very ergonomic, and it's very low level. Instead, at the moment, you're likely to create WebAssembly by compiling well-supported languages like Rust and C++ to WASM. And these got support earlier than a lot of other languages because they do their own memory management. They're very low level. You don't have to bring along an interpreter. They're the simpler side of the problem of building WebAssembly modules. With that said, work is in progress for a variety of other languages. And we are working on solving the interpreter problem. And there's different options for handling garbage collection, whether WASM modules bring garbage collection with them or they use a runtime facility for doing that. For a little bit of history, WASM was a web technology. It was created in the web. That's why it's called WebAssembly, after all. And it became a W3C standard in 2019. Even before then, back to 2017, it was supported in all the major browsers. So it's been out there for a few years now. Despite being a web technology, being something that came from the web is supported in the browsers. It isn't just a web technology. It's not just for the web. It has these other, these valuable properties that were created for the web that everybody else kinda wants to, believe it or not. The portability that it needed to run in all the major browsers on a wide variety of systems is useful no matter where you're trying to deploy your applications. Portability's great. It needed to be fast to run in the web. We expect, for example, out of a web page, really low latencies, them to load quickly. Really, we should hope expect that more. And so as a result, WASM has a really low startup latency. In fact, you can actually start compiling WASM before you've even finished downloading it all using streaming compilation. It's gonna do code generation validation sequentially. It's also capable of near native performance. Think within about 20 to 30% of native. So there's a wide variety of applications that are capable of running in this kind of environment that maybe you wouldn't wanna run in, per se, JavaScript. On the security side, to run the browser, it needed to, you know, it couldn't have ambient authority to the entire user's permissions, right? These needed to be sandboxed in some way. And so WASM does that by being capability safe, which you can just think of as, you know, web assembly modules can only do the things they've been given permission to do. You feel like actually give, and you satisfy and import, bind some capability to them for them to be able to call something. And additionally, they have sandboxing. Each WASM's linear memory is independent from any other ones and a web assembly module can't call out and access a linear memory that it hasn't been given. And so for all these different reasons, you could sort of imagine why people would wanna use web assembly outside the browser. The company I work for, Single Store, uses web assembly for database extensibility. We let, you know, customers compile their code to web assembly and run it in process in the database as a user defined function. Companies like Cosmonic are using it to create distributed application systems. Fastly is using it for edge computing. Fermin is using it for service run times and platforms. Shopify is using it for plugins and extensibility. And Microsoft, in addition to the Blazor project which is about running web assembly in the browser .NET style, also supports running WASM modules in its Azure Kubernetes service. All six of these companies and many others are part of the Byte Code Alliance, a non-profit foundation working on implementations of these open standards to help mature and advance the ecosystem. And the sort of web assembly that we're all creating that's gonna run outside the browser needs to be composed, shared, and deployed in a bunch of different environments. And to do that, we really need a package registry. We really need somewhere to publish your WASMs and to pull them down. That's something that you can interact with at build time as you're compiling and also maybe at deploy time. And when we're creating this registry, we really want it to be as secure as web assembly itself. If web assembly has these really nice isolated properties and capability safety but you deploy it insecurely, then you sort of lose all the advantages. You wouldn't want to secure a vault with a Cheeto, per se. So stepping back, we said that web assembly needs a package registry. What is a package registry? What's this thing that web assembly needs at a more sort of abstract level? The main role of package registry sort of in the kind of package registry we're talking about, because there's diverse kinds of registries out there. Really, its main job is to delegate names. Like, you're now the owner of LeftPad. I've given you permission over LeftPad to package owners who then use that permission to publish releases that are versioned of those packages. So put a different way. A registry is really just an index. It's a mapping from the name and version of a package to what's there, to the digest of the package contents. It has some special rules about how it gets updated. So fundamentally, this kind of registry that's this mapping of names and versions to contents has to own the meditative packages. It has to be the sort of authoritative place where that info lives. You can establish the mapping between things and their contents. But fundamentally, it doesn't actually have to be the thing that does the content storing. Well, you can actually store the content in a self-hosted location that's with the registry. You could also defer to a CDN using existing OCI registry or any sort of third-party content mirror to deliver the content. And since you already know what hash you're expecting, you can validate that you got the right thing no matter where you chose to try to get it from. And so now, in building a secure registry for WebAssembly that is an index that solves the problem that we need to solve, we can't always have any ideas about applying certificate transparency concepts to package registries. So the rough analogy, and bear with me a little bit, is that with certificate transparency, people are able to detect when CAs miss issue certificates, right? Whether it's accidentally or maliciously, you can tell that something was done incorrectly, that something was done wrong. And with registries, it wouldn't be kind of nice if clients could detect when registries accept invalid package updates. So for example, if somebody else tries to publish an update to your package, a new release, but they're not the actual owners of that package. They're not the ones who are supposed to be controlling that name. So package transparency does this. It applies some tools and ideas from certificate transparency to package registries to give us these properties. Package transparency, as defined by Landmark who coined the term in our project, is publishing cryptographically verifiable commitments to the state of a package registry, and bear with me on this part, to allow auditing of the actions of package authors and the registry itself over time. That's been a mouthful, we're gonna break into three parts. The first of which is publicly available registry state. Everybody needs to be able to download and access the fundamental data of the registry. And we need the registry to make cryptographically verifiable commitments to that state and that data. That then allows us to audit it, both in various different ways. So this definition came up with what we're trying to do, and we've created a protocol that is an implementation of these ideas. Warg, it stands for WebAssembly Registry, sort of is the idea, is this implementation of package transparency. And it implements these three different steps we're just talking about and we're gonna try to burn through explaining them. The first part of it needs to have publicly available state. People need to be able to download information about what the registry knew about it at different points in time. And it does that by representing every package as an append-only log of signed records. They all begin with an initial record, and that record says who the original owner of that package is. And if a registry accepts this record, that means that it's saying that you now own this new package. Every subsequent record contains the hash of the one before it. That's what that right-to-left arrow is, and can do things that affect the state of the package, like creating releases, but also granting the permission to do releases and to manage authorization. Now that Bob sort of has been given the ability to do a release, he subsequently can do that. He can append records to the log and submit them to the registry. And when they're accepted, these releases now exist and they're part of the state of this package. Since it's an append-only log, you might think, well, I can't get rid of those releases somehow. I can't delete records from it. It's immutable, it's append-only. So the answer to how you would delete something is that you mark it as a yanked, right? There's a second kind of record that you can append to this that says, pretend that didn't exist. If you've still got it, you still know the hash, that's fine, but for most clients in discoverability and things, treat that as having been deleted. And so really registry can be made up of a bunch of these package logs, a collection of package logs. And that then is the publicly available state of the registry, the records that make up the logs for each package. Now we need to make cryptographically verifiable commitments to these data. And we're gonna start by somehow making the registry commit to what's happened in the past, like which records are part of its accepted history and in what order they happened. And we do this by taking a data structure from certificate transparency that maybe some of you would be familiar with called a verifiable log that's based on that of a Merkle tree. And so it's sort of the abstract data type of a verifiable log. What do we get out of it? Is that a verifiable log provides a total ordering, describes the state of any given point with a unique checkpoint, and we can verify if a record is in the log by comparing the record and the checkpoint. That's what we're gonna get out of it. And here's how we're gonna do it. There's a sequence of records and unlike the other kind of log we had here, they don't have the hash of the previous one. They're all independent records and they all have their own independent hash that we can compute. And if we wanna know what the representative checkpoint is for this whole log, we can hash together these leaves of a tree to form branches and then the root of that tree in a particular way. And as we add more records to this log, what we find is that the root hash that represents it changes. And sometimes that root hash is a subtree of the future but sometimes it's not. And it has to do with whether or not the length is even a power of two, but we'll show later on that that's not an issue, that's just part of how it works. And so since every single leaf contributed to the value at the top, we can show that a given value is in the root by reconstructing the root from the leaf. And we do that simply by hashing together values with their siblings to get higher and higher nodes in the tree until we get to the top. And if that matches what we were expecting, we've shown it's been included. There's also something called consistency proof. You're gonna want to generally know that this log of the state of the registry only fast forwards, that the history doesn't get changed. And you do that by showing that the new state is consistent with the old state, right? That only six and seven have been added and changed. And you do that essentially just by showing that all the stuff that used to be in the log is included. We don't need an inclusion proof for zero through five individually, we just need inclusion proofs that show that their parent branches are and that you can recompute root six from them. So what does the registry claim has happened? And the answer to that is the sequence of values that are in its verifiable log. And each of these records has both the record itself hash and the name of the package hashed into it. So you can tell that this is, A zero was put into A, right? It's been appended to the package A. But one thing clients can't efficiently do here is tell what the most up-to-date, the latest record is in each of these logs. They have to walk the entire verifiable log backwards and in the worst case, that's walking everything that ever happened if your package has not been updated in forever. So that's just entirely intractable to clients. So we need something that helps us solve this problem. And that tool is another thing from transparency called a verifiable map. And where the other thing was a log, this thing's a map. It's sort of that straightforward really. It's a key value mapping. It's also described by unique hash. We can also check that things are included in it. Those are sort of the fundamental things we want out of it. And unlike the other one where we had this sequence of records that built up a Merkle tree, we actually have sort of a Radix tree. And if some of you are familiar with Radix trees, you'll know that at each branch, you decide which child to go to based on part of the key, right? And so if we want to insert zero, zero, one into this tree, we go left then left then right so that we've walked in the tree to the correct position that corresponds to that key. Now in reality, this tree actually has a height that is directly equal to the number of bits in the hash. So it's quite large, but on screen, we can fit three levels, so that's what we do. And we've inserted X here, and what that means is that the hash that represents this leaf is the hash of X. We'll get to that later. And if we insert more things in the tree, we'll see it fills out that more things are present and that this tree is fundamentally sparse. It's not a full tree, it's a tree with only some values present. And we'll finally insert Z here at one, one, one. You can sort of see how this is working. And like the other one, we build the hash of the root incrementally from the bottom-ups that every leaf and key, every key and value in the leaf contributes to the root value. So we hash, in this case, all the leaves to get their value with a unique prefix. Then we can hash branches like the one, one branch over there and show that it combines the two values of its children. But we also have these branches out here that only have one sub-child, right? Only one sub-tree to the left or right. And in that case, we prevent collision attacks by modifying the prefix so there's a bit that indicates essentially which sub-tree is not present. In the top, both sub-trees are present, so it's the normal case of one, one. And together in that way, we've hashed up the information about a key value mapping. And just like in the same one, we can do an inclusion proof. We can check that one, one, zero was mapped to Y in this root. We can verify that in log-in hashes by simply reconstructing the root using those same hashes we were using before that constructed the root originally. So how do clients know what the latest record is? We have a verifiable map that associates every log ID, which is the hash of its name, with the most recent value in that log. In that way, people can verifiably say that was the latest thing the registry knew at that point in time. That's as far as I have to read up to. And these two-day structures get combined together into a single checkpoint, which sort of includes the hash of both, the hash of both, and a signature by the registry. And that is the state that we're committing to for the registry. And you note that this uniquely identifies the state of every package at one point in time. So if you ask the question, what was the latest version of foo for some checkpoint that has a single immutable answer throughout all of the future? And you can actually granularly update to new checkpoints and verify that everything stays sane as you do so. So it's quite a nice property. So our third part, if you remember, we had three parts. We had publicly available registry state. We just finished making cryptographic commitments to it. Now we need to audit it. And we audit it, well, from a few different people, but the first person that audits it is the clients. Clients are fundamentally resource constrained. Most clients don't have supercomputers running just because they want to download a package. That's sort of an insane view of it. Even small devices should be able to be clients. And they only really care about some things. If you're a client and you want this package and dependencies, you don't want to verify the entire registry state. That'd be absurd. So what you only want to do is verify the relevant package state in that that package state is committed by the registry. That the registry is committed to that information. So you don't care about a slice. And you also want to know, of course, that the commitments are valid and correct. So at the beginning, a client knows nothing. Complete clean slate. And what they'll do is they'll download the logs for the packages they care about. In this case, it's all the packages in our fake registry, but you can imagine that there's many, many packages they don't care about. Then they're going to verify the package logs. And they're going to sort of sequentially process each one using that sort of logic we talked about earlier where if I grant someone a key, then they can do something. If I don't, then they can't. And releases have to, you can't release two things of the same version. There's rules that you sequentially validate on. For all the packages you care about. And in this case, we're processing this log that actually fails because for some reason this guy named Charlie is trying to publish a release, but he never got permission. So this log is going to fail validation. We can detect for sure that this was not supposed to happen. We'll continue in the success case where the logs are fine and show you what would happen next. But you could already have bailed at this point if you detected that. So the next thing you get is the checkpoints. In this case, sort of the word log and the word map represent that root hash of each structure for compactness. And you notice we don't have any of the other information about the log or map. We don't have any of the things that make them up. We don't need them yet. Until now, because now what we're going to do is actually download just enough information for those data structures, just the Merkle audit paths, the little sibling and uncle nodes required to verify exactly what we need very sparsely. And we can show here that the registry is committed to these things being the latest and that these things are part of its history. Clients will sort of also validate the signature on the checkpoint and say that this was actually committed to, claimed by the actual registry, not some random third party that made up a registry checkpoint. Now, oh yeah. In addition, one thing they can also do is if they already had a checkpoint, then for the log at least, they can ensure very quickly that it's a fast forward, that all the log records they previously knew about are still part of the registry. The one thing that they can't really do is verify that the log and map agree with each other, right? They can't easily check that there isn't some record in the log that's not represented in the map. That's newer than what they know about. So, who are clients going to call? Monitors, not mythbusters. But so the monitors are able to do this. They're long running processes. They only care about valid and cryptographic information in a stream, and so monitors are able to process all the checkpoints that ever were and make sure that they all agree with each other internally and across time. And so that's what the clients can't do. And because clients can talk to as many or as few monitors as they want, you can add security by simply standing up more monitors and sort of create a web of trust. Right, so revisiting our three parts of package transparency, we needed publicly available registry state that we got with a collection of package logs. We needed a cryptographically verifiable commitments that we got using signed log and map checkpoints. We needed auditing of package authors in the registry and we do that using a client and monitor sort of system. Once again, a kin to stick with transparency. So, we've done all of this. How does it hold up against attacks? Well, let's imagine that the registry gives you package data. You download package data, but it gives you a modified record in the past. How are you gonna know that they modified record in the past? The answer is that two things are gonna immediately stop being true. They'll stop validating correctly. The hash linkage from A2 backwards A1 will no longer work because A2 didn't contain the hash of this modified record. It contained the hash of actual A1, which is not what you have here. And additionally, the log inclusion proof here won't succeed because that checkpoint did not include this provably. What if a registry tries to hide something from you? Right, what if it tries to say, no, this thing didn't exist. Don't worry, there's not a security patch. You're fine, that version wasn't yanked. If it does that, then the linkage here from the map is gonna fail, right? The inclusion proof that says, no, A1 is the newest thing as of my checkpoint will fail because it wasn't, and it already committed to the fact that it wasn't. Now, you'll note, we already talked a little bit about the consistency of the log and map here, so the idea is that if the registry had created a map and log that didn't agree, where it did say A1 was the latest thing in the map, but it still already knew about A1 and its log, then the monitor's job is to catch that. At this point, those two things have already been proven to be consistent. And so then that's how you know that you're not getting indefinitely frozen, for example. So, to summarize sort of the broad points of the talk, Wasm is a promising way to make portable and secure software. Package registries, or at least the kind that we need, are really indexes of content that map name and version to what's actually there. Package transparency is this combination of stratific transparency and package concepts, and it helps us provide some really interesting defenses against different kinds of attacks. With that all said, I'd like to give a special thanks to a few people who helped review and advise this talk, like Land Martin and Luke Wagner, and thank some other contributors to the Warrick project like Bailey Hayes and Peter Hewn. Finally, my company, Single Store, for supporting Warrick by enabling me to work on this project, and that's all. So, Warrick itself is really about managing namespaces, and saying that this name is owned by these people, and this version is authoritatively that version that was claimed to be. If you want to claim something, the thing that you're probably interested in, and I think a lot of people want and are currently doing, and we really are interested as well, you wanna verify that the author of that content is someone in particular, and you'll actually do that inside the content in our case, because WebAssembly, for example, has an emerging proposal to actually include signatures in the Wasm Blob itself, and that component can contain a section with signatures in it, and a registry as a matter of policy could even require that your component has that section. Because that's sort of a separate question, once again, is like, this was actually made by Microsoft, Google, Fubar, whatever, is one question that it is actually the rightful version 1.0 of Fubar is a separate question. But you can trace back, right? Because the signatures in the log all are contiguous, right, and they sort of authorize subsequent signatures. So that's how you can know that the people you originally gave it to, gave it to people who originally gave it to whatever. And this is not that different from, in fact, this is much more verifiable than like NPM, for instance. If you wanna know the people that you gave permission to be left-pad on NPM, are actually the same people today, you currently here have way more than you ever had before. And if you want this thing about author verification, once again, that is just an aspect of content signature. That's a separate question that's handled separately. How do I know what the key of this registry is? So one thing in the current design, and this is something I didn't talk about in the talk because it's sort of a detail, there's actually not just logs for every package. There's actually also an extra log that's the operator's own log, where they can track their own key rotation. And that's actually, since it's part of this committed state, you can also actually track and verify changes to that key as well. And it has a bit of a strange circularity, but it's one that works out, we think. Yes. So the way we currently envisage monitors working is that monitors are gonna subscribe to a stream of registry data from the registries that they monitor, and they're gonna process it. And then clients actually, like the actual API client, you'll set, you'll pick your registry that you're using, and you'll actually pick a list of monitors that you trust. And when you do that then, whenever you pull down data, you're actually gonna ask that monitor, hey, do you know about and is this checkpoint safe? And so it's not that monitors tell clients things, clients ask monitors things. And so it goes in that direction. And we expect monitors to be able to process quickly enough that you can do this sort of nearly or essentially synchronously with the install process. And that monitors actually has another thing, are one of the defenses against indefinite freezes, because what a monitor can do is tell you, you say, hey, I just installed this checkpoint, right? This checkpoint, blah, blah, blah. You can say, whoa, that's like three months out of date, buddy, somebody is holding things back from you. But yeah, it goes from client to monitor is the direction of the interaction. All right, so because of the separation of namespace and content, we fully expect people will bring their own storage and to some extent their own registry systems. But when they wanna be part of the namespacing that we're creating for WebAssembly packages, when you say like, this is so-and-so's registry, so package name version is whatever, that they will probably choose to run that on top of their existing systems. And that we actually, the use cases for registries are really varied, because we start to see some general purpose good registries, at least some of them, potentially one run by the Byte Code Alliance, but maybe not, we'll see. We'll see, the registries are very expensive is the thing. Registries run by different individual companies and projects, like if you're a company, let's see, what's that, if you're like Envoy, maybe you wanna have different sort of like, your own registry of your own Envoy sort of extensions and things, right? My company may run a registry at some point of database extensions. You also potentially have deployment registries inside your company that are the only place that you go to download things, that mirror content from other places. A lot of companies might run their own monitor, because monitors, once again, we expect to be relatively cheap, and as a result, you can have your own and then use another registry, but know that you're always monitoring it. So to see a bunch of different registries run on top of whatever people wanna run them on top of. Yes, at the moment, all of the actions that are taken on a package are doable only by the people who've been granted those permissions directly in the log itself, in log authorization. There is another tool actually as well that sort of adds on to that, where registries can unilaterally reject anything, actually. This is actually just a fundamental property of all of these systems. Anyway, you can always ignore or HTTP request you get, and more fundamentally here, you're able to just not include something in your log, so it didn't happen. And so one thing that we expect is that registries will use this in actually a good way for just policy. So one thing you can do is you can simply say, yeah, I'm not gonna allow you to release that thing that has no vulnerabilities. You as a registry operator can choose as a matter of policy that you only allow things that are signed, have no vulnerabilities, et cetera. And that's the thing that different registries will choose different answers to. One other thing that's sort of an interesting idea that we're currently thinking about is that a registry can actually, as a matter of policy, reject new package creation that doesn't give the registry operator some permission in the log. You can say, I'm only gonna let you have the name foobar if you give me the permission to yank stuff. And so it's all still in log, but in combination with this idea of operator policy through rejection, you can actually achieve some of these like higher level policy ideas about like giving yourself the ability to yank things. We're not entirely sure that there's some trade-offs there, especially around the size of these logs and the size of these data structures. So we're not sure. At the very least, this will give you a strongly identifiable way of talking about packages so that you can build tools like that on top of or interoperate with existing tools, which is maybe the best way for this to work. Yes? Yeah, how do you recover from like compromise of these keys? That's sort of a big question. One of the answers once again goes back to this idea about like initial package creation policy where you could say actually in order to, as a registry in order to create a package in my place, you have to give me the ability to actually reassign key permissions. And that's really scary sounding. The only thing that helps that sound a little less scary is that you can tell when they do it, right? So you can actually say like, I'm not actually gonna, I'm gonna, the log is fine, log is fine. And if the registry uses this power that they've required you to give it to reset the keys, then it's sort of like an oh crap moment where you actually are gonna want external information. It's gonna halt the next poll or whatever. So that's sort of the best answer we have at the moment is that, you know, but it's some combination of operator policy, the direct ability to rotate keys, and there's the transparency of all this that you do have the ingredients to make an answer to this question. There's also potentially an ability to make a special kind of record that operators can always do that like takes over a package. And that would obviously be very like, you know, you wouldn't download past that point without a, so it's a reset of your trust in the package authors obviously. And you'd wanna know why they did it. But that has some interesting complexity actually because answering the question what operator keys were allowed to do that at some point in time requires you to correlate the operator log and the package log. So it's sort of like weird to have this special operator record type that has permissions that come from outside the package log. So at the moment it's all in log, you'd have to have built in when you created it the ability to do that, to recover, which is a little scary sounding. So existing registries may have a hard time using this. Maybe they won't, I'm not entirely sure because you have to go back and like rewrite your history as one of these histories and I don't know how tractable that is for most of them. But to some extent what we're saying here isn't really very walls and specific. The only thing that you really have to do with Warg to run some other type of content is to add a new content type and describe how you digest it, how you create your content digest for that type. And once you've done that, everything else is just a matter of policy, right? So no, you could use Warg for something else. It's not very WebAssembly specific but we came to it from the direction of WebAssembly and solving the needs that WebAssembly has for federated verifiable namespaces. But if everybody else also needs that there, welcome to come take a look. We're happy to work with anybody. Yes. Not at the moment. Warg is currently written in Rust but it's a protocol. It can be implemented in any different place. It has HTTP APIs, it has some protobuf that describes the layout of these records in the log, some abstract semantics for how these hashes are created but largely you could create in any language and we expect at some point that we will actually compile our Rust to WebAssembly and there's some really fun bootstrapping things that you get to do there actually. And another thing that has a lot of people excited about having sort of trying to be really thoughtful about supply chain security in WebAssembly is that you could have like the ultimate version of a build pack where you have like a WASI component that actually takes you from source code to actual artifact in very verifiable ways. Deterministically even because WASM can be run in a deterministic mode. So there's some exciting opportunities there for maybe something similar. To leverage compiling things to WASM is really how I got in that tangent. And yeah, and additionally if we can compile the core logic of the registry to WASM and you can run WASM in a lot of places it becomes a lot easier to interact with these things. There's some complex cryptographic things being used and we didn't rewrite SHA by any means, right? But there's still some the use, the security design, the use of those things is still complicated to do it correctly. And so the more people can leverage one shared open source and of course all of this is open source, maybe I didn't say that, but clearly this is all open source, the registry's out there, bytecode alliance, github.com slash bytecode alliance slash registry, warg.io if you want the short version of getting around to our stuff. And yeah, and if we can share an implementation of these things that compiles to WebAssembly and runs quickly, like why not use a singular or like at least a limited number of clients that we've audited really well. So I think it's about to be lunch. So I think we'll call it there. Thanks.