 A word about me first. I'm one of the co-founders of Ivan, which is a databases service company. We operate in pretty much all the public clouds. And previously, I used to work on large-scale databases and distributed systems. I'm also the maintainer for a bunch of open source projects, mostly around Postgres. These days, not that active anymore, but still doing something every now and then when time permits. Then a word about Ivan, just so you know where we're coming from. We basically are a company that operates databases globally in six different cloud providers in 89 different regions around the world. There's eight different open source data engines or messaging systems that we provide. And we started off in early 2016 by providing a managed Postgres service. While we are not a hyperscaler like Google or AWS, we still operate at a pretty fierce scale. Some definitions before we get into the meat of the matter. Stateful systems, they typically hold important things, namely the state of your system. That is what also makes them slightly different from stateless systems, which are really easy to restart, really easy to move around. This is, by the way, where typically systems like Kubernetes have been at their strongest. Like stateless systems, you can easily just stop or restart or start running in a pod somewhere else. But in the case of stateful systems, there's plenty of data. And the more data you have, the harder it gets to actually manage them. There's also lots of considerations around durability and accidental changes to the data. And those are the kind of things that you really don't want to see happening accidentally. As for immutable infrastructure, it's basically a paradigm where servers, once they're actually running, they are never actually changed afterwards. Basically, the idea behind this is to make your deployments more consistent and reliable because you're always doing it the same way. So historically, if people deployed a service and then they went manually to change something into configuration, eventually, if you had, let's say, 100,000 machines, none of them actually looked the same as the others. So while there have been lots of different deployment automation tools for this sort of thing, typically, they still start differing over time. In order to do this, I'm going to go through some of the tool we have. But you pretty much need quite a bit of automation tooling around this because operating this at a scale is otherwise kind of hard. Anyway, I'll take us back to early 2015 when we started writing our platform. Basically, a lot of our team had background in using Debian and different companies where we had worked previously. Basically, Debian at that time had had its issues with slow release cycles. And basically, because of that, a lot of people were using backported packages. We had been backporting tons and tons of things. So one of the things we knew out of the gate was that we really didn't want to backport stuff. At least we didn't want to backport system components, which would have basically meant rebuilding the whole thing. The other thing that we wanted to do is a corollary that is basically something that worked really close to the upstream project. So typically, when you made a bug report, actually, this stuff would get upstreamed whenever the fix went in. And also, back then, there was still the hula-balu around system D integration into Debian. And that was something that we actually wanted to use already back then. Debian has a reputation for being stable. I've added the sync quotation marks, but I'll get to that later. But it has many positives as in the open source free software ethos is really strong around that community. There's no single controlling company around Debian. It's basically lots and lots of volunteers. They may be doing it on company dime, but still they're working there as individuals. And there's no single overarching company behind the distribution. Unlike, for example, in Fedora's case, where Red Hat is the prime contributor behind that. In Debian, there's tons and tons of different packages available. There's, based on Debian org's front page, they have like 59,000 different packages available. So that would mean that they basically have coverage of pretty much all the free software out there, not really, but close enough that it doesn't make a difference. Also, there's lots of Debian derivatives, Ubuntu being the most famous one. But that basically means that there's a lot of people who know how Debian and Debian derivatives actually work. But anyway, it turns out that Debian is not quite the perfect fit for us. Basically in the stable, especially back then when earlier Debian release had been few and far between, it meant that basically you either needed to start back porting stuff, or you'd have to live with the old packages, which we really didn't want to do for various reasons. And once you actually got far enough behind the curve, you actually needed to start back porting system components, then it really wasn't fun anymore because then you actually needed to do lots and lots of work that you'd rather have somebody else doing, which was the whole reason for the thing. Also, once you actually go down that path, you're not really running Debian itself, you're actually running a custom distro, which is just fine, nowhere is there. But the thing is, why did you want to go with the stable system in the first place? Didn't you want the thing to be something that the other people have proven and battle tested over time? Anyway, it was basically, we had really bad experiences on having to do back porting for ages, and we really didn't want to do that anymore. Then the other thing is like the system D will have a little back in the day when they were choosing in its, like which in its system should be the default. Even though system D was available as a package for quite a while before then, we still felt that the integration wasn't quite there. A lot of the packages still had in it, since we edit scripts there and lacked system D unit files for some time, and on the whole it was, at the time, it really wasn't that well integrated with system D, which was definitely something we wanted to use already back then. So, then we started looking at the alternatives. Eventually we basically ended up with Fedora, but there were a couple of others, but the basic of Fedora was pretty early on the main contender. Fedora has a six month release cycle. Well, that's while slipping every now and then, but still six month-ish release cycle, which sounded a bit scary for us at first because that means that we need to be continuously updating because Fedora supports the current distro release and then the one before that, and basically the one before that only gets two months of overlapping support. So, you need at least once a year, basically, to be thinking of upgrading, and if you don't do that, then you are, well, out of luck as far as it comes to security patches and whatnot. Anyway, the upside of this is, of course, that everything is fairly fresh. You don't need to backport a lot of stuff, especially you don't need to backport system libraries or system components, which is great because you don't really want to do that. But on the other hand, we, as a database and service company, we actually still need to build plenty of packages in order to be able to fix customer-found bugs or new minor releases of things or whatnot, but there's basically a tons of things that we still need to package, but instead of it being thousands of packages, now we're left with like 150 or so packages, which is great, so basically, it sounded like something we wanted to go with. And the other thing was that I mentioned about system D. System D has been in Fedora for quite a while, for, well, for obvious reasons, but it's also well integrated and it's basically been the default for quite some time and it really works fairly well. Another, by the way, anecdotal thing is my Pulse Audio set up on Debian never worked, but it worked out of the box on Fedora, so there's that, but anyway, it's not quite related to us choosing Fedora. Well, sort of. Anyway, also the RPM spec files, the way you build RPM packages, it's much, much nicer than building Debian packages. This is my personal opinion, but let's really take that to the bank, that's like true. Anyway, but I really recommend building RPM specs rather than actually doing it the Debian way. Of course, there's not one way, well, there's an official way of doing it the Debian, but there's like 10 other ways of creating Debian packages too. Anyway, what you get out of the box with Fedora is you get an up to date kernel and system D with the kernel, by the way, it also gets updated over time, so you're not actually stuck with running whatever was the released version back then, so you're actually running something fairly recent, as in reason being like the kernel release from the last month or so, so you're actually getting fairly fresh stuff. Then system D support is there out of the box, it really does work well, there haven't been any issues around that, except for a couple of system debugs. Then you get the SC Linux, which is also integrated fairly well into the system and works, but we haven't really had any issues with that over the years. Then there are the firewalls there, but in general, okay, this is by the way a difference to Debian packaging, like philosophy, if you will, when you install Debian package, let's say you install Postgres packages, by default it actually starts the server and bind it to port X, whatever the default is, and starts that usually is a public interface, and then it starts serving stuff. Of course at this point you don't really have any useful configuration for it or anything else, but we much preferred like when we install package, it does absolutely nothing until we tell system D to actually start the thing off, but this is like one nice thing that's usually not mentioned anywhere. Also these days it has the latest Python, but on the other hand it has had the latest Python forever, but the version number was just different back then, so we're like a heavily Python using house, so we have lots of code, well we have some code using Go and Java and C, but the vast majority of 90 something percent is Python for us. Anyway, then word about the topic, so generally speaking our philosophy on nodes, which by the way for this purpose of this talk are basically either a virtual machine or a bare mill machine, we don't really distinguish between those. Anyway the idea is that they're disposable, so we don't care if they go away, we expect them to go away and mass around the planet all the time. The other thing is we really don't put any manual effort into any given single node, basically we do everything by automation and that's been the case for quite a while. The other thing is we operate in six different public clouds and they all have different ways of doing things, so we don't try not to rely on their functionality too much, so we even affirm things like disk encryption and so forth, we actually use locks for that instead of using the cloud providers provided functionality. In general our integration towards the cloud providers is fairly minimalistic, but anyway there's also, there's another side of the coin because of the way we do persistence which I'll go into in a bit, is it allows us to use things like local SSDs, so typically with many of our competitors what you're using is network SSDs, so EBS volumes are Google's persistent disks or Azure's premium SSDs, but they're all basically because of the speed of light they have severe limitations on how well they perform compared to local SSDs that are basically PCI Express NVMe devices that are connected directly to the machine, that gives us some performance benefits. Anyway, the idea behind durability is also that we always have the data somewhere else than the actual node, so the idea behind this is that if the node dies for whatever reason, which they frequently do, it's still not a biggie. Anyway, we're worried about persistence and durability, so anyway, we try not to rely on EBS volumes persistent disks or premium SSDs for persistence, you can't easily move them between clouds, so one of our value props is that you can actually move your services with a couple of clicks between clouds so instead of having a migration project, you can just say that I want my service, which is in AWS EOS 1, move to US East 1 or wherever you wanna move it. I'll go further on into the details how we do that, but anyway, since you can't easily move these actual disks that are network attached, we basically solve the issue another way around. But the nice thing is well, since we actually are using local SSDs because we're handling the persistence in a slightly different way, is like here's just some example numbers, but an EBS volume can do roughly 250 megs or let's just go with reads, it's simpler they have different characteristics for reads and writes, but if you just go with reads, it's basically 250 megs and then it can do roughly 10K IOPS from what I remember. Then on another hand, like AWS, the same cloud vendors, I3 machines which have local SSDs, they can do like north of two gigs a second, it's actually closer to three gigs a second and then on read IOPS, you can get into millions range. So it's in a completely different ballpark when it comes to hardware characteristics. Then here's an example of how we do persistence for Postgres for example. So we have a thing called PG Horde which was originally written by yours truly. It's basically a Postgres backup daemon which is on GitHub and the second or third most popular one based on GitHub stores for Postgres. Basically what it does is it takes our write-ahead log, compresses the data, encrypts it and then sends it to object store. This basically gives us a bounded data loss window which is okay, but this still isn't great, but for all our HA services, basically customer gets to choose whether the data is synchronously or asynchronously replicated. So basically they get to choose how much performance loss are they willing to accept. But basically with this, you can get arbitrarily low data loss windows depending on the single node loss. Also, since we provision these in all different cloud vendors, every one of those which has multiple availability zones, we automatically spread the nodes of a given service. So if you have a Kafka cluster or a Postgres service or elastic search cluster, we always split those into multiple AZs and we also make sure that in the case of cluster systems like Kafka, if you have end copies of a partition, they are always split among the different availability zones automatically. Anyway, our approach to upgrades is basically that we do rolling forward upgrades. So basically when you have let's say a three node Kafka cluster or let's say a three node elastic search cluster, what happens is that we side by side those, we create three new virtual machines, replicate the data over and do a controlled failover without any downtime for the customer. And that's pretty much the way we do all our software upgrades or hardware upgrades. So the same thing is being used whether you upgrade, let's say you have three machines that all have eight gigs of memory, couple of CPUs and X amount of disk. If you upgrade to a larger plan, what happens is we again create the bunch of new nodes, replicate the data there, and then basically do failovers there. And this happens the same way whether we change between cloud providers or I mean if you're moving from AWS, US East One to Google's South Carolina data center, we basically just use the exact same methodology again and again. So basically once the actual nodes are up and running, we actually never touch them again. And we also do this at a huge scale. It's been fairly useful to actually have just a single way of doing this. We've had our share of issues with this, but happily there's only one way for us to do any upgrade, so we really rehearse this a lot. And then the word about system, the N spawn versus Docker, basically a Docker comes with its own set of baggage. There's some philosophical things which we disagree mostly like having a single process per container. I mean you can get around it, but still that's the general tendency. But also things like system, the N spawn is actually part of the system that you're already using. It's like already there, it's built in. It's also much more minimalistic and doesn't come with all that much stuff, but it works fairly well. And in the case of how we build images, our container images are basically just directory trees, more or less. I mean they may come in terrible or compressed format or whatnot, but they're essentially just directory trees. The other thing is N spawn integrates really well with system D, which is pretty neat for us because we can then control stuff from outside the container with system D itself. We use unit files for a lot of things and a lot of different directives in those, and then we use journal D and it's structured logging quite a bit. And both have been really good for us. Then the word about the host machine. The host machine in our case where we are running our customer services is also running Fedora. There's also a single container on the same VM or well, node, so basically bare metal machine or VM. And there's a bunch of things that are running on it. So once we provision a VM, we install our management agent there. What the first thing it does, it actually tries to refresh the package it has. We're basically doing this so that we can have immediate control over any new nodes even before we build new images, which we do frequently. So we build our images a lot, but on the other hand, if we want to say, let's say that you cannot have version X of Postgres package because we decide that, okay, this one sucks. Before we can actually roll it out to 89 different cloud regions around the world that takes a while, depending on the cloud vendor may take quite a bit of time. So we actually have created this mechanism so just we can, just like five minutes later, like when we create the next node, we can just control like specifically what kind of packages are, is it getting? Typically there's nothing to do at this step. It's just like typically the images that are there are already good enough, but it's basically an emergency handbrake. Anyway, after this point, after it's installed the packages that it needs for the customer service, the machines are immutable. They really aren't changing for the duration of the lifetime of that node again. We do have the ability to manually go there and install stuff, but it's not really done ever. It's basically only for debugging purposes if we need to do something weird. Then the management agent starts up, which we call prune, which is apparently some sort of plum tree or something. We came up with the name from somewhere, which I forget. Basically what it does, it sets up the machine to operate the customer service. So sets up the disk layout with rates and encryption, sets up basically all the cluster nodes talk to each other over IPv6 over IPsec, and then it basically restores the data either from backups, which are often object stores, or then the other way around, they could be restoring the data from other nodes in the cluster, in the case, depending on the type of cluster that we're serving. And then it typically keeps on monitoring and reporting the health of the system, and there's also other ways of doing that, but it also keeps a general sense of the thing is still healthy and there's a heartbeat coming out of it. It also reacts to configuration changes. So when we add a new node, we need to create IPsec tunnels to those. We need to change cluster configuration and different database services. And then there's also the customers are allowed to change some configuration parameters in different services. So we allow users to configure some Postgres settings. So those, while the packages themselves are immutable, there's the data files on disk are obviously changing with the database. The other thing is of course, the configuration may change within those Postgres columns, but otherwise it's completely immutable. And then the management agent actually sets up a bunch of auxiliary agents. All of those run on the host side. There's basically some we collect metrics. We collect metrics out of the system and we collect tons and tons of metrics. So there's like tons of data points coming out of those. Then we are shipping the logs from journal also in structured format. So we actually are able to search through those and we basically retain the structure of the logs that we basically put in there. And we also use structured logging for this. Then there are also backup and HADments like PG Horde, which I mentioned that are largely we've open sourced over time. The one we haven't is actually Cassandra's and we're hoping to do that too, but when we implemented it actually started relying on some of our internal things. And now we need to write it slightly in a slightly different manner. We besides actually selling Apache Kafka as a service, we actually use it internally a lot. So we have tons and tons of Apache Kafka clusters. And we are basically all of these nodes and all of these demons are actually talking to Kafka. And all the configuration changes coming from users are basically being sent over Kafka to these nodes. Okay. Yeah, then there's the container in those machines or nodes. It's basically run through fairly locked down system D and spawn, it basically contains only the customer user services. So things like Postgres or Apache Kafka or Elasticsearch or what have you. And none of these actually allow code execution. So it's actually like we haven't taken a single service type into use, which would allow arbitrary user code execution. Basically we are lacking like a really good sandbox for that. There's a couple of interesting ones like GVisor and like Firecracker that we've been looking at, but currently we don't allow any code execution. It's not that it's actually that bad if somebody actually managed to like get into their own machine. There's nothing that secret there, but it's just something that we'd rather people not shoot themselves in the foot with a foot gun. Anyway, after installation, the container again is totally mutable except for the config files which may change like Postgres config may get the user options there. And we don't really want to rebuild the whole service again just to get like a new configuration which that has different number in it. And the data files that the database itself is actually writing to. Then we're about image building. So we support the six different cloud providers and they all have a different way of for you to register new images and how you do this. So I'm like DigitalOcean actually only have prebuilt images that you can only take snapshots of after you would change them. So you can't actually upload your own images at all. So we basically have to make sure that our tooling works with all of them. Then some public clouds are fairly slow when you're operating with them and creating like base images. And especially when you're transferring it to all the regions of that cloud provider. And the ways we do those are basically cloud dependent. There's nothing really shared between the clouds. So they just have very differing implementations of how you do that. Anyway, now we have I think it's like 89 cloud regions what we currently support among the six different public cloud vendors. Anyway, the pre-installed packages that we put on the images, some of them are actually fairly large. So they actually do take some disk space. But the idea behind this is that why we pre-install them and make the images themselves already contain the stuff is so that when we spin up a new node, which we do a lot, they basically are much faster ready to serve the customers and their needs. And then depending on the cloud provider, again, like things like Google are fairly fast and things like Azure are fairly slow in booting up. AWS being somewhere in the middle. It usually takes somewhere between two minutes to 10 minutes from the time we call the API on the cloud provider side of saying, please give me a VM with specs like X. Anyway, testing, because we do a lot of these updates, so basically we follow Fedora's release cycle mostly. We basically have tons and tons of like testing. We have unit test system tests, chaos tests, whatever kind of tests. But the thing is oftentimes when we've actually hit problems, it hasn't been something that tests a particular version of something. It's actually just a generic test that starts failing and then we go investigate it and okay, it's because something changed somewhere. But if you want to follow a fast changing distribution, you really need like a fairly wide coverage test suite. That's my opinion at least. Or otherwise, you're basically sailing blind. I'm not saying that our test suite couldn't be better. It could definitely be much, much better. But it still has found lots of different issues. The other thing is with our approach, you're basically eating, like basically enduring quite a bit of pain all the time. But the other way around is if you do this every three years or four years or five years or whatever, what have you, there's basically a lot of pain when you actually do have to eventually move to the next version of the distro. So we'd rather have quite a bit of pain all the time instead of having an immense amount of pain every x years. The other thing is you really should be reading the release notes of everything with a magnifying class. It hasn't always been smooth sailing. Like recently, G-Libsy changed their Unicode collations and we got hit by this because while we were aware that it's changing, we weren't aware that Fedora had backported the change to the previous version of G-Libsy. So that came as a bit of a surprise for us. But it's something that we should have read more carefully from the release notes because it was definitely mentioned there. Then also IPsec in the kernel or in the tooling keeps on breaking all the time, which in 2019, I'd rather it actually just worked, but we still have issues. Of course, the other thing is we're using it on the public internet where the networks are not that great, which means that we're probably, by the way, exposing a lot of these to an environment where they're typically not designed to be used. Originally, things like Apache Kafka were used in in-house data centers where the networks were stable and everything was good. And the public clouds and networks aren't great and you keep on having issues all the time by losing nodes or having net splits or whatnot. Then one of the other annoying things is the DNF is a bit on the slow side. Happily, there's gonna be some improvements on this, but it's really also not really resilient against temporary network errors. So we actually use a wrapper around it. But in general, we're very happy with Fedora and it's allowed us to actually focus on what we're doing instead of actually back porting stuff again. And it's also basically through the way we're doing this, it also forced us to de-emphasize the meaning of any single node. So we really, if a node gets lost, that's fine. The other thing is we basically just need to take care of persistence another way because we don't really care about the particular nodes anymore. And if you want to do this, you really, really want to automate pretty much everything. Any questions? The first X questions get socks. So, by the way, everybody who has a question come to me later and I'll find you the socks. There's- Can we have some here? Okay, well, there are people in the audience with socks. Run C and OCI, OCI structure files and also, yeah, that's my one question. We could use something like Run C but back in the day, it actually didn't exist. How long ago was this? 2015. Oh, okay. So it didn't exist. Like, besides, I think actually a system the end-spawn is actually like trying to actually be able to run OCI directly. But on the other hand, what we're actually looking for next is actually better at sandboxing rather than like- My second question is what language is pruned, written in and are there any interesting mechanics in that orchestration piece? It's actually written in Python and, well, yeah, it's written in Python. Well, I'm not sure how, well, it does a lot of like interesting orchestration but it's also part of the proprietary secret source of ours. I mean, we're happy to open source a lot of the tools we're working on but that's pretty like also some of the stuff that we actually do. What's the database you choose to use for a pruned? It actually doesn't use internally. It's basically run on every node. It doesn't have a database of its own but what we use for our management things is actually Postgres. So I'm a long time Postgres fan and like all the founders had written some small piece of code to Postgres back in the day. So we're heavily into Postgres. So you picked Fedora for its release cycle and because it was up to date, why not go even further? And- Roll hide. There's only so much pain that we can take. Yeah, well, I mean, roll hide like back in the day it used to be even wilder these days. It's actually settled down in my opinion but users still have to draw a line somewhere. Any other questions? Okay. How do you react to a host going down? Because if you use disk encryption, it can't or I assume it can't come back on its own. Do you then just spin up a new host or- We typically spin up a new host once it's been unresponsive for X amount of time. They all send hard beats over actually Kafka. We actually have fairly good idea when a node goes down and we actively monitor for things like ACP events. So if a cloud provider deems to actually, they will actually want to tell us that the node is going away. We get well like noticed beforehand. Usually they don't, they just vanish but some of them are nicer in this regard than others. You mentioned GVisa. Do you use it? I mean, do you use it? We've been looking at it. We don't currently use it but it's something we've been looking actively at. Also the firecracker thing is another similar thing. The problem with firecracker in our case is actually that you can't actually run it on other instance that aren't bare metal. So in the case of AWS that sucks. In the case of GCP and Azure they actually allow you to run things like KVM on KVM but you can't do it on AWS. Yeah, like the talk is mostly about state management and you solved the problem by having the application having some sort of clustering. Are there any other demons that do not support clustering on their own and you have some tooling to get the state from there and sync it with some other places? Yes, so we, it's a fairly broad question but yes, we do. Come talk to me after this if you want some more details. You said you didn't use, that you use Lux. How do you deal with key management without using KMS services? So for us the keys are actually transient in the sense that they only have the lifecycle of the nodes. So once the node is gone, that's it. We don't go back to the nodes that are gone. And would God include reboot? We don't actually, even reboots from our point of view are basically the node is gone and we replace it. Yeah, so any other questions? I think I'm running over time. Cutting into that, I assume that if you restart a node somewhere then it also needs data. So how do you provision the initial data? So it was here somewhere. But the answer is depends on the service type and the question, in the case of Postgres we typically restore everything from object store up until the point of the very latest data and that we actually replicate from the other hosts in the cluster. But if you provision from like an object store then you need secrets or authentication or that kind of stuff. Yes, so all of this is encrypted. So basically we do client-side encryption. We have the keys for the encrypted data in the object stores obviously and we basically just restore that. We use things like PG Horde or My Horde or others like that. So those are all open source on GitHub. Yeah, anyway, I think that's it. If you have any other questions, please find me after this. Thank you very much.