 Hello, everyone. My name is Hayden Blousfern. I'm a software engineer on Google's open source security team. And today, we'll be talking about the intersection between privacy law and transparency logs. So at a high level, transparency logs are immutable and append-only data structures, meaning what goes into the log can't be mutated and can't be removed. And we'll talk a bit more about how these logs are constructed and applications for them later on. One thing I've been thinking about recently is, what if I want to remove data from an immutable log? Well, one answer to this might be that I can't. The contract of the log states that what goes into it is immutable and can't be removed. So maybe this isn't a valid answer. But when considering privacy law, I'm not certain this is an acceptable answer, technically. If privacy law states that you should be required to delete something from a source where data is collected, I'd like to have some technical solutions to be able to solve this problem, to actually delete from an immutable log. So that's what we're primarily going to talk about today. We'll start by discussing Merkle trees, which are the underlying data structures for transparency logs. We'll talk a little bit about logs and their applications. We'll have a brief aside discussing privacy law and the bits of privacy law that are relevant to transparency logs. And then I'll conclude with some technical approaches for implementing a mechanism we'll call redaction, which is the ability to remove something from a log while not breaking the cryptographic properties that the log guarantees. So let's first talk about Merkle trees. And before we do that, I want to briefly discuss cryptographic hash functions. Because the properties of cryptographic hash functions give us some intuition about why Merkle trees are able to provide certain properties. So a cryptographic hash function is typically a function that is described as one way, meaning given some message, I'll pass it through some hash function and then get a digest. And because it's one way, there's no un-hash function, no way to take a digest and get the message that was used to create that digest. And typical hash functions include things like Shah or Blake or MD5. There are three properties that are worth bringing up with cryptographic hash functions. The first is what's called pre-image resistance. This is given y, some digest. It's hard to find some value x, a message, such that the hash of x equals y. This is effectively formally defining a one way function. Given some digest, I can't find the underlying message. And I'll note here that when I say hard, I mean improbable or infeasible. So for example, if a digest is 256 bits long and there's to the 256 possible values for a digest, it's not going to be possible to enumerate all digests to find some matching message. Another property is second pre-image resistance. So whereas in the previous example, we fixed the digest, in this case, we fixed the message. So given m some message, it's hard to find a different m, m prime, such that their digests are equal, hash of m equals hash of m prime. And closely related to this is collision resistance. This states that it's hard to find any two arbitrary messages, m1 and m2, such that their digests are equal. And it's worth noting here that collision resistance implies second pre-image resistance. Meaning if I can't find any two arbitrary messages that are equal, I won't be able to find two messages that are equal when I actually fix one of those values. So let's talk about Merkle trees and keep these cryptographic hash function properties in mind as we talk about them, because we'll get some intuition about why these Merkle trees can work the way they do. A Merkle tree is typically represented as a binary tree whose leaf nodes are going to be some hashes of input. In this case, it's d1 through d4. And every non-leaf node is the cryptographic hash of its children. So for example, looking at the left side of this tree, in order to build hd1, d2, I'm going to concatenate its children, h of d1 and h of d2, and then hash it. And then going up one level to build the root hash, it's the same thing. I'll concatenate the two children and hash that value to get the root hash value. And this property, this structure for the Merkle tree allows us to calculate what's called inclusion proofs. And this is a set of hashes that allow us to prove that some value is actually in a tree. So before we talk about what specifically can go into an inclusion proof, let's create a brute force solution. So if I want to prove that some value is in a tree, first I'm going to need to effectively pin the root hash. So when we consider things like verifying a chain of certificates, you have some leaf certificate, and then it chains up to some root certificate. And this is some certificate that you've been given or you implicitly trust. Maybe it comes with your operating system. And this is similar here. The goal of calculating a proof that proves that something's in a tree is that we'll eventually be able to calculate that root hash. And thinking about the properties of these cryptographic hash functions, because they're one way, I can't take a root hash and effectively figure out some values that would eventually calculate to that root hash in any probable way or any feasible way. So one way to calculate inclusion might be to give you every single leaf node in a tree. And you've also been given the root hash in some way. And then I can recalculate that root hash by taking the pairs of nodes, calculating the parent, and so on and so forth up to the root hash. And if I can calculate that same root hash value, I know that some element was in a tree. Well, the problem with this solution is having to give you every single leaf node. In this example, there's just four, so that's doable. But let's say there's a million or two million or a billion leaf nodes in the tree. That doesn't really make sense to have to calculate inclusion for every single node if you only care about one. One thing I'll note here is that in order to provide this root hash offline, typically one way that this is done is that the transparency log will end up having some signing key and then creating a signature over that root hash. So you don't actually have to fetch a root hash out of bands. The log can give you its root hash and as long as you have the verification key of the log provided out of bands, then you know you can trust a root hash. So let's take a look at now calculating an inclusion proof with just a subset of nodes. So like we said, the goal of calculating the inclusion proof is to verify the root hash. So I need its two children, which in this case is h of d1, d2, and h of d3, d4. And as part of that subset of nodes, I also need to have the hash of the element I actually want to check inclusion. In this case, it's d3, so I have h of d3. And what you'll see in this example is I don't actually need every single node in the tree to verify the root hash. I only need that root hash as two children and then looking down the right side, I don't actually need h of d3, d4 because I can use h of d3 and h of d4 in order to calculate that parent node. And what ends up happening is you only need the sibling of a node at each level in the tree and you only need the number of levels in the tree, number of nodes. So looking at this inclusion proof, all you need is h of d4 and h of d1, d2. And because you already have the root hash provided out of bands and you have d3, meaning you know it's hash, this is all you need for an inclusion proof. And once again, keeping in mind these cryptographic hash properties we talked about, there's effectively no way to forge some set of hashes given some initial hash, d3's hash, and the root hash. It's infeasible to be able to calculate some other chain. So let's not talk about transparency logs. Like I mentioned early on, transparency logs are built on top of Merkle trees. You can see the structure looks very similar but the primary difference is in the leaf nodes. Transparency logs are typically meant to be publicly audible, meaning those who are putting entries into the log or consuming them will likely want to know values that were recorded in this log. So typically transparency logs aren't just going to have hashes in their leaf nodes. They'll also have the values used to calculate those hashes so that the logs can be searchable. Because transparency logs are built on top of Merkle trees, it allows us to calculate inclusion proofs so we can prove that any element's in a tree. And there's actually some types of Merkle tree constructions that allow you to calculate non-inclusion proofs that allow you to prove that something's not in a tree. Just to give you an example, there's a type of Merkle tree called a sparse Merkle tree that allows you to calculate these types of non-inclusion proofs. Now, Merkle trees are not inherently immutable or append only. If you want to remove an element from a Merkle tree, that's fine as long as you recalculate all of the nodes, all of its parent nodes up to the root hash. So in order to have this property that transparency logs are immutable and append only, we're going to need to calculate what's called a consistency proof. Let's take a look at that example. So effectively, a consistency proof is showing that from some state where I've already verified consistency or maybe there's some initial state, if I go look at a new state, entries have only been, and no entries have been mutated. And the way this is done is effectively by verifying that there are some subset of hashes that allow us to calculate some root hash where we verified consistency previously. And those set of hashes exist also in the right side of the proof where we're verifying some new root hash. So if we look at this example, let's say that on the left side, we have some root hash where we've already verified consistency. And on the right, this is a new state where we've added some elements to our tree and we want to verify consistency from it. On the left, we need its two children, which is HD1D2 and HFD3. And on the right, we need the two children of that root hash, which is HD1D2 and HFD3D4. But once again, we don't actually need every single node in the tree, we just need these siblings. And we need that same set of nodes that was on the left. We need a subset, that subset in the set of nodes on the right. And once again, keep in mind these properties of a cryptographic hash function for why this works. Because if anything was changed in the tree, I would have ended up changing the root hash so I wouldn't have been able to verify what's on the left, that root hash where I verified consistency from previously. One thing I'll mention here is that transparency logs are not equivalent to blockchain. So if you're familiar with blockchain, blockchain is effectively built on top of Merkel trees and has these similar properties of being immutable and append only. But there's a difference in how we enforce consistency that they're immutable. Blockchain relies on things like proof of work and stake, where it's computationally expensive or maybe you need to put up some money in order to verify transactions on a chain. Transparency logs rely on a different mechanism, which is called gossiping. And we'll talk about that in a moment. So let's first build up this gossiping protocol for keeping logs consistent. In order to do that, first we're going to need what's called a witness. So like I mentioned before, transparency logs typically sign their root hash so they have some key that's known and in order to trust that root hash, I just need the logs key. Well, as a client, if I want to verify consistency, I can do it myself or I can rely on a third-party witness to check the consistency. So much like a client might persist some root hash to verify consistency from a witness can do the same and they create what's called a counter-signature over a root hash. So in this case here, we effectively have two signatures over the same value where the second one on the right here is a witness saying that I have verified consistency from this point. What witnesses primarily prevent is what's called a split view attack where the log effectively forks and presents different views to different clients. So let's imagine we've inserted two values, D1, D2 and into the transparency log and for whatever reason, maybe the logs malicious or maybe the logs misconfigured, it chooses to manipulate the second value D2 in certain cases and this results in an effectively a different tree. And if I have multiple witnesses who are asking the log for their latest root hash and verifying consistency but the log chooses to fork for certain clients, what you'll end up seeing is that most of our witnesses are counter-signing the same root hash, but one witness ends up counter-signing a different root hash and maybe to that witness it looks consistent because it's always getting this different view of the log. So as a client or as another witness, how do I know if a split view attack is happening? Well, one way is that as a client, I can ask all of the witnesses that I trust for these counter signatures and let's say I have the verification keys for all the witnesses, I can verify these counter signatures and then compare the root hashes and if I see any root hash that differs from some root hash that I've persisted, let's say from an inclusion proof, then I know that there is some split view attack and I can alert the ecosystem to this problem. Well, as witnesses, let's come up with some protocol that allows these witnesses to find out if other witnesses are having a split view attack. And that protocol is effectively called gossiping. And I'll mention that gossiping is a very active area of research within the transparency log space. So I won't talk too much about it, but roughly you can imagine that we can come up with some protocol where the witnesses share their counter signatures with other witnesses. And if they see differing root hashes, they can know that there is some split view attack or some misconfiguration. And once again, alert the ecosystem to this problem. As a client, beyond just monitoring the cryptographic integrity of the log, I also might care about what's in the log. I mentioned that logs are meant to be publicly auditable. And so as part of this, I might want to periodically query a log to find out are there relevant entries in the log that I care about? And this could include things like identities that go into logs or certificates or verification keys or artifacts. And we'll take a look at this in a moment as we look at the applications of transparency logs. So the first one I wanna mention is certificate transparency. Before certificate transparency came about, it was possible that certificate authorities that were issuing public TLS certificates could be misconfigured or maybe even malicious and issue certificates when a domain owner didn't actually request that certificate. And there was no way to basically hold certificate authorities accountable. And this is the goal of certificate transparency that the actions of certificate authorities can become publicly auditable. So as a domain owner, let's say I wanna fetch a certificate for example.com. I'll request a certificate from some certificate authority or CA and present some proof that I'm the domain owner. And then the certificate authority will write that certificate to a log. And those logs can be audited by monitors, both for cryptographic integrity that nothing's being deleted from the log and from the domain owner to see if this example.com shows up in the log when I didn't request the certificate. The log returns back a proof. This doesn't have to necessarily be an inclusion proof. This can be some sort of a sign value from the log that says I promise that I will commit something to a log. As a client or as a browser, I will then check this proof comes alongside a certificate. This is effectively an optimization so that a client doesn't always have to contact the log and ask for some proof of inclusion, though that's also acceptable here. And the idea that is that if I can present this proof, I know that the certificate's actually in a log and therefore the domain owner can actually verify all instances in the log. Basically it forces any sort of malicious behavior to be in the open. The next example that I wanna bring up that's particularly relevant for supply chain security is binary transparency. So instead of putting certificates in a log, I'm going to upload artifacts. So let's say I have some artifact that I maintain, I'm going to sign it and upload it to a log. And the benefit of this system is that it's going to ensure that all consumers receive the same untampered artifact because as a client, I will go request an inclusion proof every time I see some signed artifact that I wanna use and check that it's actually in the log. And there is no way that I'll be able to locally tamper with this because I'll check that it's actually in the log. And as a malicious actor, if I were to upload a malicious artifact to the log with my own signature, that client wouldn't trust it. We can even go one step further and say, let's say that the signing key of the artifact owner was compromised and someone was able to upload a malicious artifact that was correctly signed to the log. Well, once again as an artifact owner, I can monitor these public logs to see where my signing key is used. And if I see any cases that were unexpected, I can know that there was some compromise. And once again, you can have public monitors that are verifying the cryptographic properties of the log. One example of binary transparency is SIGSTOR, a project that I work on. SIGSTOR aims to simplify the creation of digital signatures for artifacts and containers. And it does so by providing a client to simplify this generation. And also some services. One of these services is a transparency log. This is represented in the middle here as a recor. And this transparency log allows you to upload binary artifacts and their signatures. One other goal of SIGSTOR is to simplify key management. So instead of having to manage your own keys, we're going to associate some ephemeral signing key to a long-lived identity. So effectively what this changes is your verification policy. Instead of verifying some public key, I verify some identity. And as a maintainer, I don't need to hold on to a signing key. I just need to make sure I'm in control of my identity. And this association's done with a certificate authority. That's at the top here, Fulcio. So the typical flow is as an artifact owner. I'm going to request some certificate from the Fulcio CA. And I'll note that this certificate also gets written to a certificate transparency log. So that as the owner of an identity, I can monitor this log to make sure that the certificate authority or maybe the identity provider is not misconfigured. The artifact owner will sign the artifact and then upload that certificate and artifact to record the transparency log, which can be publicly monitored. And then we'll staple all of that together, the artifact, the certificate, and some proof that comes back from the log. And then that can be uploaded to wherever a client's going to pull that artifact from. In this example I show, for example, a package repository like Maven or RubyGems. And as a client, I will then verify that proof to make sure that the artifact's actually in the log so that I know that the artifact owner can be monitoring the public log to see where that artifact shows up. And we can also imagine here in the bottom right of this example that the client might also ask the monitors or the witnesses for their counter signatures in order to mitigate split view attacks. The last example I wanna show is key transparency. And this is where four keys that are managed by key owners, they can upload to some public log associating their key with some identity. Effectively what this enables is the discoverability of user's verification keys in a way that presents the same view to all verifiers. This is the same with binary transparency except we're presenting keys. One other benefit of this system is that it presents a permanent log of all key changes. So let's say I need to rotate my key, maybe I've lost my key, maybe for some reason my key's expired. And I want to rotate it. Well, as a verifier I can look in the log and see this change happen. And then I can, for example, go to key owner and confirm that this key change was expected. And as a key owner, I can verify every place where my identity shows up. And if I see places where my identity is mapped to some key that I don't expect, I can know there was some compromise. And vice versa, maybe I see my key showing up associated with identities that I'm not in possession of. So the takeaway really from all of these examples is just that transparency logs force everything to be out in the open and present a permanent record of changes that occur. So let's switch gears and talk a little bit about privacy law. So there are a bunch of goals of privacy law. Typically privacy law is about protecting a consumer from data collectors and giving the consumers certain rights. Some of these might include transparency around how the collected data is shared and accessed by the data collector. This could include also access to the data and portability so that I can take my data to other data collectors if I wanna switch services. This also might include the objection to data collection. We've all seen things like cookie pop-ups, for example. And also security requirements around that collected data, for example, encryption at rest and transit. The one I wanna focus on is around a right to deletion that's as a consumer, sorry, as a producer of data, I should be able to delete my data from a data collector. Typically privacy law is just about individuals but in this example, I wanna mention corporations too because I think it's worth calling out that there might be secrets that a corporation wants removed for some reason. When we talk about the information that gets deleted for individuals, typically we're talking about PII, personally identifiable information. This could include things like your name, email address, physical address, phone number or any other sense of information. For corporations, this might include business secrets. So thinking about the example with certificates, this might include some domain that's associated with an unlaunched product. Maybe the business didn't mean to issue a publicly trusted certificate and now that they have their domain is out in the public. Like I've mentioned many times, transparency logs are immutable and append only, which means what goes into them inherently can't be deleted. I think it's also worth noting that logs can be mirrored too because these are public. Maybe they're just logged for performance reasons. For example, to put the log into a different region so that it's more performant to contact that log to get inclusion proofs. So there is a question here around if I wanna delete data from one log, how do I also delete it from all the mirrors? Another interesting question is, do logs provide a benefit such that they should be able to opt out of a right to deletion? And there is a precedence for this already with some privacy law. For example, governments might be able to opt out of privacy law or healthcare. And there's some options to opt out for statistical or historical collection purposes. I don't have an answer to this. I don't know if they should be able to opt out but I think it's worth considering this as privacy laws continue to grow and evolve. But I wanna assume that we'll need some technical solution for this. And so that's where we come to building a redaction mechanism. A way of being able to remove data from a log when requested while not breaking the cryptographic integrity of the log. And these are the primary guiding principles for our solutions. And as we'll see in a minute, not all the solutions follow all of these. So the first guiding principle here is usability that the log must remain verifiable. So we cannot break the cryptographic properties of a mutable and append only for the log. Another one is I don't want the log to be overly trusted. I wanna make sure that we can hold the log accountable for redactions and make sure one, that it actually redacts the information and two, that it can't redact too much information. So the first two examples, I'm going to call destructive approaches and we'll see that these technical solutions don't work for keeping the log verifiable. The first might be, let's say I wanna delete D4 from the log, I simply remove that leaf node. In the example on the left, I just delete a leaf node. On the example on the right, I delete the leaf node but also recalculates all the parent nodes up to the root hash. Well, the example on the right fundamentally breaks consistency proofs as we talked about how those are calculated. For the example on the left, this ends up breaking when somebody goes to audit the entire log and they request every single leaf node from the log to verify that all the internal nodes are still consistent. And what they'll find is because there's no D4, you can't verify that parent node. So the solution doesn't work. Another option is simply removing a log. Now, this is technically an option, you could, but it's gonna be very disruptive to clients. To just have to remove a log for just one entry. Now one thing to note is that logs are typically sharded by year. This is a performance optimization to prevent the log from indefinitely growing and having to have a lot of storage for storing the log. So maybe if we keep our sharding window down very small, this might be a possible solution. But ideally I'd like to avoid this approach because it's somewhat of a scorched earth strategy. It is very disruptive to clients. The next two approaches I call prevention mechanisms. They don't actually delete PII from the logs, they aim to keep it out. And to give some examples here, let's say that we are uploading certificates to a log. Well, the first thing might be, well let's just not misuse certificate transparency and recognize that what goes into a certificate is public. So one example might be add some enforcement mechanism to make sure that you're not publishing private domains and requesting search for private domains. Another example might be avoiding putting PII into custom X509 extensions. And maybe we ask certificate authorities to do some sort of cursory check to make sure that PII doesn't wind up in extensions. Uploading artifacts, it's the same thing. Let's make sure that we're not publishing private artifacts. But obviously with both these approaches, this, these are just prevention mechanisms. Things still slip through the cracks. On the identity side of things, one option would be relying on pseudonymity. Having some mapping between some pseudonymous identifier and some public identifier, or sorry, some private identifier like an email. So maybe when I'm uploading some mapping between an identity and a key to a log, I'll use my pseudonymous identifier. And as a verifier, I'll be given some private mapping somehow. And I'm somewhat vague here because that verification policy to me is the primary challenge here. If I still need that private identity to verify things, how do I create that mapping? How do I distribute that in such a way that my identity remains private? Also, this solution only works for identities. It doesn't work for all examples. And once again, this doesn't actually remove anything from the log, it's just a prevention mechanism. One thing I'll mention here is there is some very interesting research around zero-knowledge proofs that might give us some ways of presenting pseudonymous identifiers in logs. But that's very active research. So the next two examples I think are most promising. So one example is that we build some log front end in front of the transparency log. And we have it be responsible for holding what's redacted. But the data actually doesn't get deleted from the log. So for example, if I wanna request D4 from this log but I wanna redact it, the front end would return some error. There's actually an example of this. The NFT marketplace OpenSea I found was the best example for respecting copyright takedown requests. So if some copyrighted artwork winds up in an NFT and it's requested that it be taken down, the OpenSea front end removes access to that NFT. But obviously that NFC still exists on the Ethereum blockchain. So you can just go view it there. There's also a lot of centralization in this example that we rely on the log to do the right thing. And as witnesses, we still need to verify the cryptographic integrity of the log. So we still need access to all the leaf nodes, meaning witnesses still have access to the PII. So now we come to the last solution which I think is the most promising. If we talked, as I talked about previously with Merkel trees, typically the data actually wasn't in the log. So what if we had that same approach here? That we move the data out to some database, or for example, a content addressable store where data can be referenced by some digest. And we have pointers to that data. And whenever somebody goes to request some entry from the log, the log resolves that pointer, looks in the database and then returns that with the proof. And the example where I want to redact some data, the log remains intact. It remains cryptographically verifiable because all of the nodes are still in the log. But the underlying database where the data actually is, where the PII is, has been deleted. And so if we try to resolve that pointer, there'll be no data there. And so the log can return some error that it's been redacted. Now like with our previous solution, this does put additional trust in the log. It does require that the log, for example, can't redact all elements. So in this example here, the log could just choose to delete D1, D2, D3, D4. And there's no mechanism to prevent that because those are not a part of the tree. So they don't have that same property of immutability. So potentially, and maybe this is a bit of a turtles all the way down solution, we create more transparency logs. And every time a redaction occurs, a transparency log publishes that redaction to this transparency log. Obviously that log also has to be monitored, but maybe it could be operated by some other party that doesn't operate the transparency log. And monitors can verify that there aren't entries that are redacted that don't wind up in that redaction log and vice versa. So wrapping up, it might be clear that there's really no straightforward answer here. And I think that's kind of an interesting part of this problem that I think the last approach is probably the most promising and if we needed to do something very soon, that's what I would recommend. But this is an area where there's an opportunity for more research and seeing if there is some other potential solution. But like I said, if we needed to do something immediately, I'd probably recommend some solution in which we move the data out of the log and have some way of monitoring when redactions occur. Thank you very much. I think we have some time for questions. Yeah, so the question is, what about being able to revert some transaction in the log? So in this case, I'd imagine that because we're mutating some state in the log, we're effectively removing an entry and so that's going to break one of the promises of being append-only. So are you thinking, yeah, so I think with that approach, the problem would be that something still exists in the log and is still publicly auditable. So even though we can say no longer trust this and for what it's worth, that would be a good mechanism for example, building revocation. So where we don't care about that there is PI in the log but we just wanna say no longer trust this for some reason. But I think if we want a redaction mechanism in which something no longer shows up in the log, that wouldn't be completely sufficient since we'd still have the data actually in the log. What are those? Yeah, so the question is in the diagrams and I'll go to one example, we talk about monitors. So we'll go to this example. So a monitor in this case, I think can play two roles. Either it's a witness, which is verifying the cryptographic integrity of the log or a monitor can be inspecting the entries themselves. So technically the artifact owner could also be one of these monitors or maybe it's verifying other properties. For example, in the case of certificates, maybe it's looking for weak keys in the log. So those are I think the two primary things, either verifying cryptographic integrity where you don't actually care what's in the entry just that the log itself remains consistent or doing introspection on the individual entries. Yeah, so the question is, does SIGSTOR have automated monitors? And that's actually a very timely and great question. So SIGSTOR is actively working towards a more general availability where we're gonna offer stronger guarantees on around SLO for example. And one of the things also is gonna be making sure that we have monitors too. So something I've been recently working on is a monitor that's based on GitHub actions. So it's very easy for anybody to set up and run to start verifying consistency. And then we'll need to figure out some mechanism for being able to notify people we could use things like issues. There's also some other work. So I'll mention here that our transparency log is built on top of a framework called Trillion. It's effectively a scalable Merkle tree. And there's a very interesting repository called Trillion Examples that has an example of a witness called OmniWitness that is effectively an easy to run witness that does these consistency proofs and then uploads those counter signatures to GitHub. So that as a client, I could then query GitHub or maybe we could also extend it to upload these counter signatures to somewhere like GCI. That's a really great question. Awesome, well thank you very much.