 I think it's time for me to get started. Hello everyone, and thank you for joining me for my talk. Today I'd like to tell the story of a distributed data system that we built for one of our clients that's globally distributed and that we optimized using reinforcement learning to save a bunch of money. So hopefully it'll be an entertaining story. First, though, I should talk about the motivating use case and give you a little bit of background of why we were building this thing. In 2016, FADF, the Financial Action Task Force in the United States, wrote a rule that said cryptocurrencies and blockchain virtual assets like NFTs now have to be regulated like banks so that when assets of a certain fiat value cross a border that they are under the regulation of what's called the travel rule. And what the travel rule means is that Vasps virtual asset service providers, which are kind of like blockchain banks, they have to perform sanctions checks and know who the originator and who the beneficiary is of any transaction. And the purpose of this regulation is anti-money laundering and anti-terrorism. And even though that ruling comes from a United States regulatory body, almost every country in the world has a similar version of this regulation. And so whenever you have now, in 2023, whenever you have a cross-border blockchain transaction, if it's Bitcoin or whatever, your Vasps, as long as you're not a self-hosted wallet, you must comply with this travel rule. Now, as you might imagine, Vasps are tech companies. They're interested in consensus and blockchain and cryptocurrencies. And they're not banks. And they don't have big compliance departments, as you guys might expect from recent news with different types of Vasps. So they got together and they formed a nonprofit open source working group called Trissa. So Trissa stands for the travel rule information sharing architecture. And the goal of Trissa was to enable a non-centralized peer-to-peer exchange of the PII information of the account holders in these transactions in order to comply with the travel rule. And as you might expect, it was very tech-centric. So they took a certificate authority model. They used MTLS. They wanted it to be decentralized. And so it had to be a peer-to-peer sharing network. So that's the premise of this application. So what does Trissa do? Well, in a peer-to-peer network, you have to figure out who your peers are, right? And you have to be able to trust your peers. So what Trissa was going to build was something called the Global Directory Service. And the GDS was a service that was going to hand out certificates, host CRLs, certificate revocation lists, and conduct a process known as Know Your Vasp, or KYV, where Trissa could assert that a business entity in Japan is actually a legal entity, is a company in Japan, and that they have a purpose, a business purpose for requiring this PII. And similarly that another vasp in Lithuania is going to protect that PII to the standards that the Japanese vasp requires. So as you guys can see, by its very nature, the travel role requires a globally distributed system, right? And so we put up six Kubernetes clusters in Iowa, Sao Paulo, Frankfurt, Sydney, Singapore, and in Tokyo. Now, as we're thinking about the data storage for this, we're on Google Cloud, and so our first thought was, okay, well, what's the globally distributed database? Let's do a cost pricing estimate for, say, Spanner. And the number that we got from the Google cost pricing calculator was just astounding. And you can imagine that when I talked to this non-profit board and the chairman of the board, and I said, hey, we're gonna have to spend over $100,000 a year in infrastructure costs, he just laughed right in my face. Like that is just not gonna happen. A little breakdown of these costs. We have this sort of blue area here, which is the fixed cost of Kubernetes, right? So I'm gonna be paying for, honestly, these are three-node clusters, right? They're not very big. And we pay for the control pane, play for three nodes, and then pay for some boot disk storage on there. And then the other cost is egress costs, right? That's a big one when you have a globally distributed system. The price per gigabyte that you're gonna pay replicating data across these different replicas or just internet traffic or data between the different VPCs and different regions and things like that. So that's actually a major cost, but it's not the biggest. That's actually that little sliver here. The majority of the cost came from the Spanner nodes. And I didn't do anything special. We're talking about a 512 gigabyte system, right? So that's what we put into the calculator was just one Spanner node in each of these regions, 512 gigabytes, $8,500 a month, just way too much. So we obviously got sent back to the drawing board. So that led us to this question. Well, okay, what are our actual data storage requirements? And this is an interesting question. And part of this talk is, I want to encourage you to ask that question more. In my technical career, what I've experienced is that most people choose their data storage systems either based on the familiarity, like if you have an elephant, everything looks like a watermelon, right? So they're familiar with it and they're just gonna use it and they're not gonna use anything else. Or here at KubeCon, we might experience this a little bit. You might be more persuaded by the marketing and branding of the VPC companies or backing some of the larger distributed companies. Or frankly, you're just gonna go with whatever your cloud hosting provider is gonna provide you. But I think that we as technologists should start thinking about really what are our actual data storage requirements? And so knowing something about Spanner, we said, well, do we really need really strong consistency with a distributed SQL model? And the answer was no, not really, in fact, not at all. Our workload is extremely read heavy. We get reads at a frequency of about 30 per second and they happen 24 hours a day, right? Now, the world turns, so in daylight hours, you're getting more frequent reads from those regions than others, but the searches, lookups, scans of the data storage, that's just constantly happening. Writes, on the other hand, are periodic and bursty. So what we see is about 250 writes to a handful of objects over a few day period that happens like every 12 months or so, right? So they're 800 vasps in the system. They don't all happen at the same time, but on the same collection of objects, there's a burst of writes every 12 months and then there's really no writes to these objects in between. And we also noticed that writes aren't happening in all regions. They're only happening in two regions. They're happening in the region where the administrators are and in the region where the vasps is, right? Those are the regions where the writes happening. They're not coming from Frankfurt and Sao Paulo and Sydney. They're really just two region writes. So we thought to ourselves, hey, why don't we build our own replicated storage system? Now, that probably leads you all to the very obvious question, well, should I be building my own distributed database? And that's not for this talk, but if you would like to speak to me after the conference, I'm happy to talk to you about your workload and brainstorm with you about whether that's the thing to do or not. And if you look at last year's KubeCon, my colleague, Rebecca Bilbro, gave a topic on that or gave a talk on that topic of whether or not you should. But I'm not trying to answer that here. We rightly or wrongly did do that. So what kind of replication did we use? We use a type of replication called anti-entropy. So when you have a system which is fully replicated, meaning every object has to live on every single replica, which is a pod in this case with its own data storage, its own disk space, then as you write to that replica, the state of that replica diverges from the state of the rest of the system. And so as you're writing to different replicas, the state of the system keeps diverging more and more in an experience's entropy, right? Because you might ask two replicas the same question, what is the value of X and get two different responses. That's entropy in this case. Anti-entropy is periodic synchronizations. So at some routine interval, a replica is going to select another replica in the system and they're going to synchronize so that their versions all match each other. And it happens in this sort of two phase way. That's the bilateral part. The local replica selects a remote peer and then pushes its version vector of all of its objects over to the remote peer. The remote peer compares its state and sends any versions that are later than the local replica or the initiating replicas versions back. And it also sends a request for versions, that's the pull part, of I need these objects that you have that are later than my versions. And so at the end of this bilateral anti-entropy, this push pull, both the local replica and the remote peer are synchronized. They have the same version until entropy diverges their state yet again. So this of course is an eventual consistency model, which I think is a term we're using a lot here at this conference, but all that really means is that if you make a right in the absence of no other rights, that right will eventually appear on all replicas in the system with some time that I'm gonna get into in a little bit. This system also uses distributed conflict free version numbers or land port scalars in order to detect conflicts. So in case there's a right on two machines, one of those rights is always gonna win and that's called the latest writer wins policy. Now this has important implications for consistency. So just putting on my professor hat a little bit, I'm gonna say that consistency is a spectrum. So when we say things like eventual consistency, strong consistency, sequential consistency, causal consistency, linearizability, those are really just talking about either points or ranges in the consistency spectrum. More specifically, in this type of system, there are two kinds of inconsistencies that we are worried about. The first is stale reads, which means that when I ask a node, give me the latest version, it doesn't actually have the latest version, it has an earlier version. In the Trissa system, that's okay, right? These reads are happening 30 times a second and it's pretty common to retry these peer-to-peer exchanges in the case of a failure or a stale read. So we don't actually care about stale reads. What we do care about is forked rights, where there are a series of rights to one version of an object and to another version of an object concurrently at the same time and we don't know which version of the object we really should keep. That's bad because that can introduce compliance and regulatory issues. So we absolutely need to minimize that inside of our system. Now, there is a paper by Peter Baylis. If you guys are interested in the academic literature on consistency called Probabilistically Bounded Staleness. It's a very easy read, it's an excellent, it's a fun paper, but what that says is that your consistency or the failures and consistency that you might have are relative to how often your accesses are, right? So there's some probability based on your access speed. And so of course, and also your system size, I should also say that too. So what that means is that if you are only reading once a minute, but your system can fully synchronize in 30 seconds, you will never experience, it's very unlikely that you will experience a consistency issue, if that makes sense, right? Because you have a full 30 seconds where the system is in a stable consistent space for you to do those reads. What that means for us is that we're concerned about visibility latency. So visibility latency is how fast it takes a right to fully propagate to the rest of the system. And as part of my dissertation, I defined the visibility latency here, this lambda v, the visibility latency, as equal to t, which is the time between intervals, log three n, which is the size of the system, plus this weird parameter, epsilon. And now we're gonna start to see us getting into the reinforcement learning part of things. Epsilon is noise, it's variability. So let's talk about the idea of selection for anti-entropy, right? So when you are doing these bilateral anti-entropy, you uniformly randomly select someone in the audience, right? So if we are all replicas, and I wanted to perform anti-entropy with one of you, all of you would have an equal chance of getting selected for synchronization. And this has a number of great properties. One, it will eventually converge, meaning that right will appear on the entire system eventually. It minimizes broadcast, right? I can synchronize to you, you can synchronize to someone else, so I don't have to shout out to everyone. And that's great to minimize the number of packets in the network and your e-gross stocks, tasks. It's fault tolerant. If someone leaves the room and they come back, they will get updated, right? And it's dynamic. If people come in or leave, they can get updated, right? So there's no joint consensus, the system size is variable, it can change over time. But the noise is the problem, right? The noise is this idea where a peer might select another peer where there's no synchronization. That delays the time it takes a right to update across the system. And experimentally, what we found was that this green line is perfect synchronization, right? That is the ideal visibility latency relative to system size given an anti-entropy periodicity of 125 milliseconds. This is real life, right? And you can see that there's a big gap between the sort of mean visibility latencies that we were seeing in actual systems versus that perfect system. And if forked rights can cause regulatory problems, we have to do better than that consistency model. And so that's where we brought in reinforcement learning. So a quick definition, what is reinforcement learning? Reinforcement learning is this idea where intelligent agents make choices to operate on some sort of environment, usually some sort of data-centric environment, and they observe the outcomes in order to maximize some cumulative reward function. Typically, reinforcement learning is used in gaming for AI agents and NPCs or used in robotics, but it is widely used in database machine learning as well. So we're gonna use a type of reinforcement learning called multi-armed bandits in this case. It is not the most sophisticated of reinforcement learning techniques. There's Q learning and adaptive policy learning and neural network-based learning, but this is a pretty easy one to explain from the perspective of a system. So the multi-armed bandits problem is consider that you're faced with a bunch of slot machines. And each of those slot machines, they have like an arm you pull, right? So that's the one-armed bandits. So you have multi-bandids because you have multiple arms you have to pull, and they all have different payouts and odds, but you don't know what they are ahead of time, right? How do you maximize your payout? Now, you guys probably imagine that you're gonna pull one arm and you're gonna see what happens, right? Did they get a payout? And if you don't get a payout, you're gonna pull a different arm, right? And if you do get a payout, you're gonna pull that arm again, right? And so mathematically, what reinforcement learning and multi-armed bandits does is try to figure out how do you maximize that payout? And the main thing that we have to worry about with multi-armed bandits is this idea of exploration versus exploitation, right? So let's say that you find a slot machine where you pull it and it's giving you like three diamonds, right? So not as good as the triple sevens, but you are getting some small payouts for that machine. You have to decide, do I wanna go look for another machine that might give me triple sevens or do I wanna stick with this machine and try to just keep maximizing my reward for these triple diamonds, right? So that's exploration versus exploitation. There are a number of mechanisms to deal with this, including Bayesian sampling and upper confidence bounds, but the thing that we employed is called epsilon greedy exploitation, where if we have a peer in the system with a probability epsilon, and I'm sorry, this is a different epsilon than the epsilon from the earlier slide just to make that clear, we'll just sort of choose any peer in the system, and when the probability one minus epsilon, we will choose the best peer in the system. And then there's a variant on this called annealing epsilon greedy, where epsilon starts large, so you're doing a lot more exploration at the beginning, and then it anneals or decreases over time so that you eventually favor more exploitation versus exploration. So how does this apply to our system? Well, what we wanna do is we wanna try to reduce that noise from uniform random selection. We wanna learn the optimal topology of the system based on the accesses that are in the system so that we prioritize peer selection that leads to synchronization. And so what we did is we created a policy-based rewards function, and so this is the rewards function table right here, so it's bilateral, so we have a push side and a pull side, and basically you get points, right? So if you do synchronize at least one object in a pull, you get 0.25 points. If you synchronize it on a push, you get 0.25 points, right? So that's the major. We really want to reward synchronizations. If you synchronize multiple objects, you get even more reward, and if your latency is lower, right, you get even more reward, right? So we wanna come up with like the fastest visibility latency. We wanna remove all those latencies between our connections. So this was our reward function. Generally speaking, when you're looking at reinforcement learning, you are looking at creating a system of policies, and these are actually very easy to set up with a different type of reinforcement learning called Q-learning where you create Q-tables and you try to optimize policies, and so you could think that, you know, here this is a consistency-based optimization, but if you want to add rate limiting policies, security policies, node scaling policies, right, you could add any sort of policies that you wanted to add to this reward function and optimize your system for the use case that you're working on, right? That's one of the very big benefits of reinforcement learning in a distributed systems context is that these things are, it's up to your imagination, and these policies can be as complex or as simple as you would like them to be. And what happens when you do this is you get a globally emergent decrease in visibility latency. So this is another big idea that I want to make sure I share with you guys. It's the idea of global emergence. Global patterns emerge from the interactions of local simple agents. In our case, each replica only knows about the peers that it is most rewarded for selecting, right? It doesn't know anything about the system as a whole, and it certainly doesn't know anything about visibility latency. The global pattern that we see emerge from all of these local replicas behaving independently is this global increase in our overall consistency model. And actually, when you visualize this, it becomes very apparent very quickly. So the vertices in these graphs are our replicas. The size of the vertices are the number of accesses on the replicas, and the size of the edges between the vertices is the number of synchronizations between those replicas. On the left, this is uniform random selection, right? Just a giant hairball, you know, just a mess of connections. Nothing is going anywhere, every which way. After we applied reinforcement learning, a distinct topology emerges. We see that in it, oh, and I should say the color is the region, right? So you can see that in each region, we have three replicas. We have a primary and two secondaries. The primary is where most of the local accesses are happening, and we see interesting things happen like, oh, the secondaries are responsible for replicating across regions, and the primary is replicating locally, which means that our user accesses are happening much faster because they're not on a bogged down replica. We also see that there are no long distance transmissions in this topology, right? Tokyo goes to Sydney, Sydney goes to Singapore, Singapore goes to Frankfurt. It doesn't go Frankfurt to Tokyo, right? So the replication happens in a way that minimizes latency overall. And this was really just a first experiment with a very simple policy, right? We could have easily created more policies and gotten different globally emergent behavior, but the results on our first try were just so great that we kind of stuck with it. Now that's how we were dealing with consistency, but I wanted to talk about how we optimized the price tag of running this distributed system. So we did a couple of other things that also considered probability. Oops, and I forgot to mention the real world results here. I'll talk about that in a second. These are actually the results here. So on the left is a graph that shows the reward of the system. This purple line is the cumulative rewards for uniform random probability. The orange, red, and green lines are annealing Epsilon Greedy and different Epsilon versions. And what we saw is that the system very quickly starts beating uniform random selection. And it turns out this is actually adaptive. So if the access patterns change, then the system will adapt itself to access patterns. Not in real time, it takes about 150 time steps. And remember those time steps are 125 millisecond intervals. But so it does take less than an hour, let's say, in order to adapt the system to new access patterns. This chart, the green is the visibility latency in uniform selection and the orange and the blue are the visibility latencies in the reinforcement learning system. And visibility latency decreased on average by 490 milliseconds, e.g. by four synchronization sessions using the reinforcement learning pattern. And remember what I said about probabilistically bounded staleness? At a read rate of 30 Hertz, or 30 reads a second, that 490 milliseconds really improves our consistency model. So it was a very important result. Back to some of the other things that we put into the system, we added object sampling, right? So again, the variable costs that we have is egress costs. So shipping all of these object vectors all over the place is not affordable because we don't want to be replicating or extending all these object vectors all the time every 125 milliseconds because we're gonna get charged for that. So instead of sending them all, we use a probability decay function to select what objects are included so that the more recently the object has been written to, the more likely it's included in synchronization. And as it gets older and older, right? As we get towards that 12 months, it stops being synchronized until there's this burst of rights again and then it starts being synchronized again. We also added some management for forking so that we could totally eliminate those regulatory issues or compliance issues that we might have by using a machine learning function to identify forked rights and to attempt to automatically merge the documents in the system together. If we couldn't automatically merge them, then we presented them to administrators to deal with it, like figure out how to put this stuff together and that completely reduced our conflicts in our system. But obviously you don't want to be telling your administrators, hey, look at these documents all the time. So it was nice that we had that increased consistency from the reinforcement learning before we even got to that point. So what was the final result? Using the Google calculator, we went from $8,500 a month down to $2,500 a month. And we essentially replaced Spanner with Disk Cost, having SSDs that were 512 gigabytes large. Our egress costs also went from $276 a month down to about $38 a month. So we massively dropped our egress costs as well. And this number, the Trissa board, the nonprofit working group could swallow. Now this is based on the Google Cloud estimator costs in order to compare apples to apples. And if anyone wants to see those spreadsheets, I'm happy to provide them so you can see. In reality, we're actually spending about $17 to $1800 a month in billing. And that's primarily because we're not actually doing 512 gigabytes of data. We're a little bit less in the system. We just wanted to have that ceiling to make sure that we didn't lose all of our disk space. So what are some of the takeaways from this talk? Well, the first thing is the data systems that we choose really need to be based on our workloads, not on our familiarity with them or on the marketing. We wanna make sure that we suit our data systems and our replicated storage systems exactly to what we're doing. And I realize that is harder than it sounds because a lot of times we don't know about the accesses in our systems until we've deployed the systems. But that's a really good case for reinforcement learning, right? To learn the accesses inside of the systems beforehand. So a second important takeaway here is that machine learning techniques can and should be used to optimize distributed systems. Here at KubeCon, if there's one thing I can ask is I never wanna have to put requests or limits in a YAML file ever again. I don't understand why the controllers aren't just figuring that stuff out for me. So I think reinforcement learning might be a good way to figure out what those requests and what those limits are rather than relying on the heuristic scheduler algorithms and the things that already exist. But I shouldn't have to ever put those into a Helm chart, right? That should be done by some sort of machine learning system. And this is a place where we can explore a lot of different ways to optimize different types of globally emergent distributed systems. This is also an interesting design pattern. I don't know how many of you guys develop distributed systems, but there are a couple of different models, including the actor model of how we understand distributed systems. But I want to say that potentially we have to think more about homogenous emergent systems. When you're creating a service, think about its interactions as producing the global behavior and that the global behavior is different than its interactions. And you might be surprised about what you can actually do when you think about designing systems with emergence in mind. And finally, if you see any sort of probability in your system's research, you should think reinforcement learning, right? Reinforcement learning is a very good way to compute probability or to estimate and customize probability. It's certainly far better than using uniform or some other just flip a coin technique when it comes to probability. So that's all I had. I'm happy to take some questions, but first a couple of asks. The system that I described, we are porting away from TRISA MasterCard into its own open source repository called Honu. So if you are interested in using that system, replicating our experiments, messing with the reward function, it's at github.evirtational.io slash honu. And if you could give us a star on that, we'd really appreciate it. Otherwise, I would invite you all to come up and talk to me here at the conference. If you see me in the hallway, I'm pretty easy to talk to. And I know maybe it's hard to get to talk to me right after the talk, but anywhere you see me, anywhere in the conference, just pull me aside and I'd be happy to chat about the problems that you're facing and what kinds of workloads that you're using. So thank you very much. So if anyone has any questions, I'm happy to take them. There's a microphone right in the middle there. Very good, Shreya. We are trying to do the similar thing in different areas. Talk about the reward model in your peer selection phase. One challenge is that you are making the decision based on the historical matrix, right? But the decision you made is for the future. So how to resolve that problem? Do you involve any prediction algorithm there? Yeah, that's a great question. So how do we... So that rewards table was actually not the probability table. What you do is you observe, like you create a synchronization and you observe the rewards for that synchronization and then you update the probability that you're gonna select that peer based on the rewards. So over time, the probability of selection does change based on accesses. And then that epsilon factor, if something goes to zero probability because you're just never synchronizing with it, that epsilon factor is what allows you to maybe go back to that peer in case things are changing inside of your environment. But the rewards table itself is just a policy table. It's not a probability. My rewards table happened to add up to one, so it kind of looked like a probability, but it itself is not a probability. Okay, I have a second question. So one challenge of this model is computing complexity. So when the problems, the skill becoming bigger and bigger, so it takes over hours to get the best results. So I want to answer, do you have similar problem when facing large scale system and any suggestion here? That's a good question. So we actually prefer to use online models, which are models that can be updated in real time and our systems work in real time. So what we had to do is we had to add some additional compute and memory to all of our pods in order for them to handle this real time computation on every synchronization, but it turned out to be a fixed cost and it does get spread over the entire cluster. If you're using something like Q-learning or something that's using a neural model, then you're right, the cost could get a little higher because you might have to attach a TPU or a GPU or something in order to do the back propagation or the softmax computation. But looking at the online, especially the online Bayesian models, is a good way to react to large systems and scaling this up to much bigger systems. Yes, please. Hi, I wanted to know what method you used to determine the window of historical data that you would use for the updating of the model? That's an interesting question. So the answer is we didn't, we just kept it all. It goes all the way back to the very first data point in that you were capturing in, I don't know, at Epoch? Yeah, exactly. It's been running for, I want to say, three years, two years. It's been running for two years. So we haven't had any sort of constraints on that yet. There is something called contextual bandits that uses a reinforcement learning within a reinforcement learning process to decide how to decay probabilities over time. But as it turns out, the state isn't actually that large because the state is this reward's floating point plus the cumulative probability function. So we haven't actually experienced any state problems with our computation, at least running the system for two years. Potentially if we run it for 10 or 20 years, we might have some concerns, but at the moment we don't have that problem. And what we will do is we will inject local resets. So we will reset out all the probabilities on local nodes on a periodic basis just to make sure that the system stabilizes and that isn't trapped in a bad configuration. So that's another way that we handle the historical problem. Is the training on the global context or is it done within each local context? Yeah, so with reinforcement learning, these are agents. So each replica is its own independent agent and it's happening in real time. So they're updating their probability every 125 milliseconds and there's no sidecar training process or sidecar inferencing process. This all happens in real time in an online fashion. Thank you. Cheers. Last question. Yeah, good question. On your rewards function, how did you select the setup parameters that were actually influencing the policy correctly? Great question. My advisor Pete and I went out to a bar and ordered some whiskeys and we just chatted it out. So, and you know, I joke, but that is actually what happened for this particular system. Reinforcement learning is a weird mix of heuristic and algorithmic, right? So the policies that you create are more like the training parameters than they are like heuristics because if one of the factors that you choose for a policy doesn't do anything for your system, then it essentially will just be wiped out of the system, right? So you can choose wrong with a policy and if it doesn't make an impact, if it doesn't make any impact as in neither a negative nor a positive impact, then it will sort of just fall out and you won't notice it. So the heuristics that you choose for the, and I mean that heuristics, policies, the policies that you choose for your reward function are what you hope to see the behavior of the system go towards, right? And so you are thinking about terms of rewards, like what am I trying to encourage, you know? And can I give that more weight and what is important but not as important as these things? So my recommendation in practice is write a list of like 30 policies, order them, take out the last 20 and then assign weights and that is usually how you can come up with a good policy for experimentation. And if you're still not sure, you can run these things in simulation, discrete event simulators do a great job of simulating this type of system and you can run those discrete event simulations very quickly for different reward functions and policy models in order to figure out the one that you want to actually deploy in production. All right, thank you very much.