 OK, so welcome. This is the last session of the day. Thank you for making it through to the bitter end. This is a talk about immutable FreeBSD, and I think all of us have dealt with enough operating systems to know that this is a lie. This is a goal. It's aspirational. It's not achievable, but we made some progress. And immutable means not subject to or susceptible to change. So this is really coming from two different perspectives, one of which is the perspective of a service that's sitting on the internet, the less attack surface there is, the less problems we have, and the other one is from the operational perspective. If it's read only, I probably won't have to fix it, and I like that. I like systems that don't wake me up at night, and they don't give me work to do in the morning. So I'm DCH. I'm a professional yak herder, automator of things. If this sounds like something your company wants or needs, ask me questions after the talk. OK, the slides, the link here again is down at the bottom. This one's the PDF version. I don't think the blue is very readable either. Sorry. Good. OK, so the enemy of the state. The enemy is the state. If we have a service that has state, a database, a web server with cookies, those are the things which make our world hard to manage. And so it's much easier to do this if you have control over the entire stack from the operating system or middleware and databases and applications, but that's what we want to get to. We want to eliminate the state. And I have a phrase, a long-term phrase, idempotent, repeatable, composable, and loosely coupled. And this is, I guess, 20, 30 years of learnings. idempotent, I can run the job again and again and again, and I should get the same result. Repeatable. I shouldn't have to run the job all the time, but if I do, it should work the same way. Composable. It's important to have clear separations between our components so that we can replace a web server, a proxy server, change databases, change hosting providers even without things breaking. And so through this talk, we've got a couple of points where I suppose I give my 10 cents of where I think it's a good place to break in these services. And we definitely want to minimise operational tooling and effort. Personally, I've worked in really two types of organisations. One's that are very small, where the total team is less than five, and larger organisations where there's maybe five continents involved, and there seems to be nothing in between. And in both of those cases, it's really important that the team that's doing operational work or application work can do their stuff without getting the way of everyone else. OK, so that's what we're focusing on. How can we minimise the long-term effort and do the work at the beginning when the application is being built? So anytime there's a red slide, you're welcome to ask questions. If it's not a red slide, you can wait a little bit and there'll be a red slide coming up real soon, I promise. OK, there's two sponsors down the bottom, Clara Systems and Juniper Networks. So some of the things we used in this talk were either built at Clara or built with Juniper Networks. And I'll mention those in more detail at the end, but thanks very much for giving back to open source. That's why we're here. So first bit is the plumbing. I start from the plumbing, the networking, because it's the bit that I'm least familiar with, and so it's the bit that I wanted to offload as early as possible to somebody else. OK, the context for this structure is that we have global services and we want customers to be able to reach the closest server, and that should happen automatically without our intervention. We should be able to take a continent out or a cluster out or a server out and have the network do the needful to make everything just work. And the first couple of times I don't think we got it right, and now I think we still haven't got it right, but it's a lot better. So the very first version of this was using carp failover on servers, so that's really like an Ethernet IP layer failover. And the main feedback I got from that from the operators at a time was we don't understand this stuff. It's too low level, it's too confusing, and what would happen is we would reboot servers for patching, carp would come up, services would failover immediately, but the application hadn't booted. And so the main change we made in this area was stop using carp and switch to BGP, not BGP externally, but BGP within our networks. And the main advantage for that for us was that BGP then became just another service that the operators could start and stop and automate just like they would an application or a container. And I guess since we put that in 2017, that's last I heard of it, so no one knows it's there, it just works, and that's the way technology should be. So people come in from the outside and we have anycast and that directs into the nearest point of presence. We've tried using GODNS and in some cases that's really nice. We can have health checks built into our DNS queries and that also has really nice fast convergence, but I haven't changed over to that and maybe we never will. So three global regions, traffic comes through anycast and then hits one of these regions and it hits our ISPs router first up. And the ISP router then has a list of announced BGP routes from inside the network and it's receiving these little messages from each of our application servers going, hey, I'm on, I'm alive and you can send me the things, I'm good at processing the things. This server though, there may or may not be applications running. So the very first thing that starts up on our systems is not the BGP router, it's actually our proxy server, HA proxy, and its job is to take the traffic that it's receiving from the upstream ISP router and decide where to send it. And it could send it locally which is the best option for us, it's faster for the customer, it's cheaper for us, but it can also decide to send it to another server or even another continent if stuff has gone horribly wrong. And proxy servers are great for that. There is one limitation and I'll touch on that later. So outside world, just recapping anycast, DNS, health check failovers, down to a region ISP infrastructure running BGP and now we're down to a single server running HA proxy shipping stuff out to various jails. I mentioned separation points where it's a good place to draw a line and the very first versions of this used, let me think now, what do we use? Auto SSH tunnels which I guess was the thing we knew at the time, but it's not really designed for long term stable connections, something like IPsec at the time probably would have been better. And what we've done is we've moved everything over to a mesh VPN, it's called Zero Tier. These days probably people would pick WireGuard but the general concept is the same. The mesh magically knows how to find the closest servers and if one of the links is down it will magically route around it. And this really does work. At the time we got extensive testing with DigitalOcean, they're in the middle of consolidating their BGP routes globally and every few days our mesh at the time would break. And once we switched over to using Zero Tier that just disappeared completely. We've noticed patches where it slows down but traffic never really got lost and I guess in, we've been doing that now for six years, that's pretty cool. I like networking if it doesn't get in my way. And a big thanks to Equinix who, the group that got me on to using BGP in the first place and also Peter Hesler whose training course to me suggested that it must be really easy so I should have a go. OK. I don't want to give whole config files because it's too much. So this isn't all of our BGP config but that's all we actually need. So clearly we're not a networking company, we're not running large enterprise networks. All we need to do is announce one local IP is being reachable and we send that to, from our perspective, one upstream ISP router. That's probably some sort of magical cluster and that's all we need to do. So pretty straightforward. So next up looking a little bit more detailed, the load balancer. This is HA proxy mainly because at the time I had a choice between engine X plus which had the health check functionality built in at a price or HA proxy which had open source built in and it was free. And so that's the way we went. So sometimes the decisions we make have long term consequences. For HA proxy what we're really interested in here is down the bottom there's a few sections here, a few stanzas. I hope the font's big enough to read. So this is a cluster database, there are three nodes in a cluster and traffic is distributed across them. I've removed from this template the logic that makes sure that you go to the closest, physically closest server because it just gets too confusing. The order here is varied between servers. So if you're in the US you'll get sent to the CO2 server rather than the CO1. If you're in Europe you get CO1 rather than CO2 and if you're in Asia you get the CO3 server first. If that server's not responding HA proxy will wait for a few seconds and then send the traffic somewhere else. What's important is it doesn't just wait but it holds the customer connections open. So data is coming in from the customer and we're waiting and waiting and waiting for HA proxy to decide oh crap it's down, I'm just going to go over here. And that time actually is significant. We don't have these failures very often but if we're doing maintenance we can plan around that. So if we're doing maintenance we take the server out of the low balancer configuration, traffic is drained and then it goes to the other servers seamlessly. So this is only if we have an operational failure. One thing I forgot to mention for HA proxy it's got tremendous lower integration. It's not super well documented but you can do all sorts of things. You can call external services to get status. So for example querying one of our other systems to know what jails are up, what applications are running well or not and you can change the state of HA proxy using these lower scripts. It's really awesome. So load balancers have two sides. This is the other half of it, the front end now. So we looked at the back end how the magic plumbing gets established and sending traffic over our mesh VPN and on the top here we're now looking at the front end. It's a very, very simple. This is an HTTP based application, very, very simple. We bind to an IP and a port and we add a couple of headers along the way before sending it to the back end. What's important about the front end is that this is the thing, the separation of concerns. Our applications don't know about the magical mesh VPN. They don't know about the clusters and they don't know about the nodes that may or may not be adjacent. All they send it to is this local port, this couch FE, the front end that's closest to them. So it's very, very simple for the application to manage and from a testing perspective it means we can do our application testing and our infrastructure testing separated. So it's another separation point. So red slide, any questions? No, cool. So onwards, now we're talking about the jails. We've gone from the outside from the networking. We've gone quickly through the load balancer and we've come out the other side and we're talking to some sort of application, the jails. So this is the first piece where we run into a little problem. How do I find the jails? How do I know what jails are up and running? When the load balance is running I talked about this convergence time. It takes a few seconds for the load balancer to decide that the traffic isn't being responded to fast enough. It has to wait in case it does get a response. But after a certain number of seconds it goes, you're definitely dead on... I'm not going to get a response. And that could be up to six seconds sometimes. That's quite a long time. If we could get the state of the jails directly then we would know like five milliseconds later that jail is not there. There's no point sending traffic there and we could trim probably five seconds off our convergence time. So that's pretty significant. And to do that, we could add 20 lines of shell. Yeah, sorry. A recurrent theme in this talk is what would happen if I didn't do the obvious thing and found a dirty solution. And this is pretty dirty. Those of you who know for best of jails pretty well will know that this is the output on the right of the jails status call. Tell me the jails that are running. And I don't do any massaging of it or any updates. I just pipe that out of a net cat to the socket. And we just run this in a loop. And you can do like a thousand requests a second to this. So in my box, that's a pretty good API for not much work. The only catch here is there's no metadata and it's a read-only API, not a not a writable one. So we can't use this to create jails or change the state of jails or add metadata to them, but at least we know they're there. And with a little bit of lure plumbing in HA proxy, that's all we need to knock five seconds off our convergence time when jails are unexpectedly disappearing. The other advantage of this little hack is that this is also available over the network. So with free best of jails, you only can find out about the state of the jails on your local machine. But with this change now, our HA proxy node can get the state of jails on any other server in the network. Okay? That's the little hack. So we're out of the low balance of world now and now we're looking at applications. So this is the first place where we're really starting to look at immutability and going, what can we get rid of? How can we simplify this? How can we take away the things to break? And over the course of the last few years we've tested two web servers because web servers are not very exciting. Eight databases and many customer applications. And I'm just going to see if I can see my speaker notes so I can tell you who those were. Cool. Now we have speaker notes. So first up, web servers. H2O, which is at the time was leading web server used for providing HTTP2. And now that's stopped being a project of h2O's, regular table releases but has become a rolling release. It's what faster users inside their CDN. We have it in freeBSD ports. It's pretty nice. And engine X which everybody knows and loves around the world. Okay. So what happens here is we prepare a jail. We use ZFS. We make it read-only. We write a small number of config files in user local app. We set the entire data set to read-only. For the web servers, Wendull FS mount the static data into the jail. We leave the TLS termination to external HAProxy and the reason for that is again there's now no secrets in the jail. We use Unix domain sockets so the jail has no secrets, no networking and it's read-only apart from a small number of directories and for those directories we make them set you ID no and exec no as well. So again reducing that attack service. The one thing remaining is logging and logging in general is pretty straight forward. We have two choices. We can mount a Unix domain socket inside the jail or better we can use syslog ng running somewhere else out of the jail, sorry have the application, send its data over syslog. In some cases that's possible so for engine X that's straight forward and for H2O it's not and the hack we found for H2O is that we piped the logs to logger which is a free VST tool and logger sends that stuff over the network and that's one less thing that the yatahi can do. There's now no way for them to tamper with the logs from the moment they gain access to the server. Everything is logged until they realise that they're being logged and they kill off the logging domain or something. So that's the first piece. For web service it's pretty straight forward and if we can't use syslog we piped standard error standard out to a tool that can and fought it off. There's no networking so there's no lateral movement and that's really important from a security perspective. If one application is compromised you don't want them to hop to something that's even more vulnerable into your database, into your private stash of MP3s, everything like that. Don't want that to happen. We've got no secrets in the jail and also they're unable to tamper with backups because we've nullifious mounted our web server data into the jail. It's read only and we can do the backups outside. From a demon perspective we've got no chron, no syslog, no NTP, no other demons and no processes running as root. So that's another thing we can monitor. If root processes appear in this jail things are going very wrong and we should sound the alarm bells. Okay. Databases, much trickier. So we share all the same tricks as before, Unix sockets, syslog, softlinks but databases typically have different storage requirements. They'll have your typical database tables. They will need to be writable. They will need to be backed up. They will need to be on fast storage. We will also have indexes, materialised views, fast, but we don't want to back them up. And we may also have a writerhead log which may even need to be on physically different storage. It may need to be on a like a zill or a slog or something like that. So for those, the database is also pretty opinionated. They have to be like this and they have to go in these directories and no, you can't change them. So how we deal with this, similar to the way we deal with the web servers. We modify the config files as much as possible. We set up a separate ZFS dataset and we can control the mount point for those. So when the server insists on a particular layout for its directories, we use FreeBSD mounts to put these in the right places. Each dataset has configurable properties. So that means we can set record size, metadata performance, caching, through-portal latency, all of those things individually per dataset for the applications themselves. Once we've done that, the overall picture is read-only dataset as usual, VARDB thing writable with configurable properties for performance. And then we run the database and inevitably the first time it crashes, so we have to look for logs, we have to look for errors, we figure out is this something that we can move so we can take the place that the database wants to write to, sometimes it's in VAR Run or VAR Log, put that somewhere else, or do we need to create another dataset again to do that? Yeah. And so finally, the ZFS snapshots we take, why we're doing the backups, sorry, why are we running the server? It can be accessed from outside the server, outside the jail, and we can do our backups on these point-and-time snapshots. And when those are completed, we can either retire the snapshot, but we don't need to interrupt the application. That's a really, really nice feature. Apologies for the small font, but we do need it, just this once. Can anyone read that at all? OK, that's all right. It's a win. I wasn't sure what rumour was going to get, so I didn't know what was going to happen. So what we're going to look at here is this is not a ZFS talk, but this is the key piece we need to take away. There's a column here called Mounted. I think everyone here knows what that means. There's a column here called Can Mount, and you probably know what that means. That means if the operating system asks you are allowed to mount it, it's not the same thing as whether it is mounted or not. OK, so it's mounted, it's mounted. Can mount, if you ask for it to be mounted, you're allowed to do so. It can also be set to no auto, and we'll see that on a subsequent slide. And then the final column, jailed. So what we have here is our Z pool called Zroot. We have a container or parent data set called slash jailed, and every data set under here is going to have mutable, writable state for our databases. This is where the goodies are, the important stuff, the data we care about, and you can see I've got, I don't even know how to pronounce that, Hedge Dock, Postgre, Softserve, and Sync. So that's like, what's that, five jails. Each of these has its own storage, and you can see on the far right the mount point column. These are mount points inside the jail, not outside the jail. So they're not visible in the root file system of our normal server under Vardy B thing. They're only visible inside the jail. OK, so this is our storage being mounted inside the jail. And then the next, back to the name column, you've got about halfway down, and you'll see there's another sort of parent data set called jails. And this is our mutable store. And so the way we've done this is we create a download directory and a download data set, and it's literally the tarballs. They can be the free BSD tarballs if we're just using source, or if we're building our own versions of free BSD with packages embedded in them, they can be our own tarballs there, custom tarballs. And then right down the bottom you'll see templates. We've only got one template here. And that's a ZFS data set. All the files unpacked from our tarball, just like a normal directory tree, and we make any tweaks we want to there. So if we're using tools like Ansible or Chef, that might be the place where you add custom configurations. That template is going to be reused across all of our jails. And so that's what we've done here. We've cloned one for HedgeDoc, one for NVIDIAs, one for Merengovia, one for Sync, and one for WWW. And each of these then is sharing the same parent data set, which means we're only caching in memory one ZFS data set. So it's a huge win there. And all of those are read-only. So that's it. The jails slash instances data sets are completely replaceable. They're all read-only. And the way we want to manage this is cattle not pets. We can remove all of those jails and we will keep our jailed data set. Okay? Does that make sense? That's really important. Jailed, the stuff we want to keep. Jails, the stuff we can trash at any time and replace automatically. Yeah, so we can erase them, replace them with a new, with a new table, unpacked table, and then just restart the jail and it'll automatically reattach its database, its writable directories. And if we've set that up correctly, then it'll just run happily on the new version. So this is looking at a slightly different example. This is a little bit more of CDVS magic. So the problem we now have is we have a complicated application. We're going to put all of one jail and we need to separate out the different types of storage into data sets and we want them all to be auto-mounted by the operating system for us. So what we have here is we have VaDB, which is—it's going to be mounted in VaDB in the jail, Zroot-jailed Greylog. And under it, it's got three child data sets. Each one of these has different settings because they want different block sizes. The open search one is this enormous peak of a database. So it has extra storage and we turn caching off it. And the other two, Greylog and MongoDB, are configuration databases. So they're nice and fast and we want all of them cached. So they have different settings and in the jail configuration, it's actually a single line, we say to ZFS, just mount Zroot-jailed Greylog DB and it does mount that and it notices the mount script, notices that it's got three child data sets and those are auto-mounted. And that's how we get different storage classes in different locations for our databases in the same jail. So down the bottom, our application's been running and there's lots of interesting things in our logs here and we have these snapshots happening automatically. Now, as a general rule, we try not to do the backups inside the jails because that could be under attack and control and then we would have no data and no backups and ransomware would get us again and we don't want that. So we do our backups from outside on the parent system and in this case for ZFS, you do your normal thing of creating the snapshot and sending that out on a stream to whatever you want to do, task snap, RESTIC, NFS, something fancy and yeah, that's it for back ups, pretty straight forward. Now, how do we figure this out? The end result is relatively easy but the process for figuring it out is painful and still to be the database, make everything read only, start it up, see what crashes and fix it. We watch it with snapshots that we do before snapshot, then we run the database, we do an after snapshot and we can ask ZFS, there's a tool called a command called ZFS diff and a bit like conceptually a bit like a normal diff that shows you what's changed but in for ZFS's case that only tells you the files have changed. It doesn't tell you what's in it but you do know which files have changed. So you can use ZFS diff to find out where the database is writing that you don't, they didn't know about. The very final piece is really important, after we've changed the conflict files and put tables and logs into writable locations, we need to install some test data, destroy everything, accept these, this mutable data directories we have, redeploy the container, start the application and dump the table again, so do a Shafts of 256 checksum, do a row count and make certain that we haven't lost anything really important to do that. OK, so for making things immutable for applications for jails and also for the operating system it's really the same, what is that, six, seven steps all the time, change all the conflict files to put the writable stuff in the writable places, use unit sockets, soft links for things like temp, var run and anything else that applications think they want to write to, move syslog to a network service if it's a jail, use the ZFS nested data sets to tune performance and well in tuning, ZFS diff to find the mutable locations and once we complete the main data set goes ZFS read only and we can use no set UOD, no exec and similar tags on our mounts for the writable locations to make it harder for attackers as well. So Michael's question is we've got three regions, how does the data replicate, and the answer is magic, magic, yeah. So we have two main things, some of the more modern applications are written with things called CRDTs, Conflict Flee Replicated Data Types and that's just a little bit of mathematics that says if I have a certain number of items in a set and I add another item to the set, can I do that in such a way that I can take any other set at a different point in time and have it order them in the correct way. That's what CRDTs do. There's a whole bunch of flavours of those so you kind of like they're tunable, you can pick a different CRDT for a different situation. Modern applications, CRDTs were invented at around 2010 I think, so that's maybe that's sort of new for people, but if you're writing your own application you can choose that sort of algorithm and then things will just catch up really more or less instantly as soon as they receive the data. So the key thing about CRDTs is there are sets with special properties such that at a later point in time you can combine the data you have and it will always produce the same ordering, independent of time, independent of how you merge them and a good example, a really simple example of this is if you all agree to give me a sum of money at the end of the talk it doesn't matter which order I receive the money in I'll still get the same amount which will be zero, right? You know it's going to be zero, but the order doesn't matter. It doesn't matter if I get five bucks and two bucks and seven bucks it always comes out to the same sum. That doesn't work for subtraction, it doesn't work for division and it doesn't work for a number of other operations but it does work for multiplication and it does work for addition. Okay, so conflict-free replicated data types. Okay, so that's one. What do we do about the applications that don't use that? And the short answer is I don't use those sort of applications. If you're writing your own apps you get that choice. It's much easier to do with this with databases that support this natively and it's much harder to do this with postgres and things like that to do. You always have this situation where you have a multi-master system, you're writing to secondary nodes and you have to have some algorithm that tries to fail over. So, a little digression. I used to work at HP first in storage and the amount of money that people will spend to make an application that was never designed to be resilient to failure on hardware is unbelievable. 100 times what they paid to develop the application in the first place. And it's much, much cheaper to make the application work correctly in the case of failure to try and work around it with hardware. And you will always lose data with the hardware scenario. You'll always lose data because the application doesn't know that there's a lag between the production node and the secondary and it doesn't know how to handle the case where the production node has begun the right. The secondary nodes have received it and written it to storage but haven't acknowledged the primary which then crashed and the application doesn't know how to resolve that and that's why we use CIDTs because CIDTs fix that problem. The three nodes start up again and they go I've got something you don't have and they know how to resolve that. Yeah, so hopefully that's a useful digression. Magic, magical maths. So deploying containers onwards. So, what is a webhook? A webhook is, I don't know how webhooks are now but it's basically what would happen if between two servers I posted a little JSON blob to an endpoint that says do the thing, make it so and that's what a webhook is and this is how GitHub, GitLab, any of these sort of software repository tools do their continuous integration. You commit source code when it's written to disk. They send a webhook to another part of their system or maybe they use a message queue and it just works for all these in series. So for version one back in 2016-ish I guess, 2015, our deployment was shell scripts, SCP reboots, hopefully customers won't notice. It's 2015, who cares? Who cared back then? It wasn't important. And for version two, we changed our internal applications and we made them package compliant. So we can build them with package create, we'll look at that in a moment. And that meant to install a new version of our application which is package install. That's actually pretty quick and pretty easy and it's atomic and then we just restart that application or that jail. So this stage it's not immutable. We're able to just deploy new packages into our containers that are clearly not immutable but it was much simpler, much tidier. And when you, there was no half finished deploys, there was no partially completed shell scripts we deployed the whole application or we didn't. Package is really good for that. Version three, we linked the Git commit for the application and we kicked off our CEI build which is another fancy word for shell script. And it would run the build, run the tests and if that was successful it would run package create and would have this little tarball which we can then deploy our systems. So it's still not immutable but deployment is pretty much automatic at this point and it's about 30, 40 lines of shell script. So we're starting to get up towards kubernetes but we've got another 20 million lines of code to go. So a little bit of room to move it. I do think the shell scripts had their place. They are not friendly to junior sysadmins. They're not friendly to people who are just making occasional changes. But they're a lot cheaper than doing a full on sort of kubernetes type deployment. There's a great, if you're a large company kubernetes is amazing. There's really nothing like it but you really need to have a dedicated team of people to look after it and that's usually not the world I live in. OK, so we've written some source code. We've pushed it. The source code tool will generate an HMAX signed HTTP request. It'll get sent off. Hits HA proxy. HA proxy does some basic validation, TLS, mutual TLS Did the web hook come from the people who said it came from? They'll do some routing to decide where an infrastructure to send that. And then we use a tool which is literally called web hook. That's at the top there. It's in FreeBSD ports and it's a lovely little goprogram. And web hook checks that the HMAC is valid and it runs as a lower privileged user. So there's no way for a privileged escalation to happen from the web hook user that runs now in a CI jail and it builds new package for us. When it's finished it requests a package-based deploy, which is another way of saying it triggers another web hook which runs Ansible with higher privileges. And all that Ansible tool does is just redeploys the jails, remove, recreate. And when it's recreated it picks up the new version of package. So we're now one step closer to immutability stuff. OK? So this deploy is automatic at this point and it has a reliance on the health check and the application being robust. So when the application starts up it should be able to respond in some way that says my database is working and talking to me. I can log things to the right places. I've got enough CPU and memory. I don't know what else. And if it replies at that point that the health check is OK, Ansible will then re-add this server, this jail back into the low-balancing pool and wait until it settles. Then it'll remove the next jail from the application pool, wait for that to drain and add the next one back in and repeat. So that process isn't quite out-and-potent. It runs the first one, the second one, the third one until they're all done. So version four is to experiment with tarifes. So tarifes is a new feature which I think is in free bestie 14 release or will be in 14 release. Jointly done between Juniper and Clara. And tarifes is exactly what you think it is. It's a tarball that you can mount. You don't need to unpack it. You don't need to unzip it. It's clearly immutable and it means we don't need ZFS if that's something that matters to you. So we'll look at that later. It's pretty neat. That's version four and my hope with this is that I can put the entire application jail into a tarball inside package. So I can go package install containers slash foo version one, package install containers slash foo version two and everything will be contained in that. So I've done a little bit of private experimentation. I've got something that works but is not pretty. It's not suitable for committing yet but the general idea looks pretty promising. So we'll look at that later. Next up. So this is how the webhooks look from an assessment perspective. Which one is this here? Oh, yeah. This is the GitHub webhook. So it's pretty straightforward. It's the sort of thing you can see yourself setting up in like 10, 15 minutes. It knows what command it needs to run. So there's an execute command there. It changes the directory and then it extracts from the JSON blob it's received from the out stream service a bunch of things which are put into into command line notes into environment variables. Now, the only thing we have to be mindful how you hear of is that these variables are now under the control of the remote service which might be hacked. So a classic way to attack an organisation today is to get malware onto a developers laptop, hack the CI, get into production, win everything. So we have to bear in mind that those values are untrusted. OK. But we don't deal with that here. That's not the webhooks problem. And on the right we've got a bunch of trigger hooks. The HMAC has to match and it has to have a signature from GitHub and there has to be on the bottom right here it says match type value value is ref's heads main. That's a fancy way of saying that the git branch this commit landed on was main. So we only run this webhook if you did a commit and it landed on main. OK. So you can do private builds on dev branches and they won't go through this process. OK. So version 1 did all of this with webhooks and shell scripts, no external dependencies and it's very, very debuggable if you need to. You jump into a jail, you run this command in the foreground, you change the include command output and response to true and you can just watch it happening. And version 2 I switched to using a hosted tool called Billkite for some of this. We get very, very nice graphs but I don't think it's less work or necessarily easier to understand. Maybe in a year's time I'll have a more refined opinion but it's definitely less tool on our side. So a bonus is an arbitrary playbook via Ansible. Sorry, via webhook to Ansible and I just thought, hey, what can we use this stuff for? And the answer is everything. So now if you know the secret HMAC signing key you can post from anywhere in the world to our infrastructure and say run this playbook. And the main reason for doing that was we had people travelling like myself or devs and other locations without good network connectivity. And if the playbook takes 15 minutes to run because you're in Hyderabad that's really frustrating when the power goes out and you don't know what the state of stuff is and with this way you post the webhook, you can tailor logs and it runs on our own infrastructure. So that's pretty neat. So package create. So the problem is we have freeBSD people and we like to put things in ports trees and we like make files and that's the way we do it and we build packages without network access because that's the right thing to do. And we like the results of that build and it'll be the same every single time. Unfortunately, not all the software we write, the tools we use, particularly the JavaScript node, no JS tools, don't really work very well like this or they're very difficult to use and so instead we use a feature of the package tool called package create which generates a package install compatible tarball. The same thing as the ports tree would produce but... Where are we going here? Yeah, identical is what the ports tree produces but we run it as a script in a command line ourselves and not part of the Pudria build tool. So it takes a manifest which is the thing you can see on the left here and apart from my sort of humorous comments all the way through, you can see that this is pretty straightforward. It's very, very much the same fields that we have in the make file in the ports tree. It's just they're in a UCL manifest. It also takes a directory of files and it bundles them up, allows us to modify the permissions and just dumps that in a directory. And that's really nice. So we've experimented with using this deploying TLS certs, signing keys, so things that are private. Instead of deploying those via Ansible, we can now put them in a private package repo and deploy them that way. SSH pub keys, anything that doesn't really need versioning but we don't want in a server golden image in case the image is leaked. So package create and package sign. For brevity, I omitted creating the package repo itself but it's another two or three commands where you make an SSH, so you make an RSA key, you make a public key, you make a public key for that, public cert from that, sorry, and use these for signing. The first package create takes the manifest in our prepared staging directory which is the same thing as you'd use in the ports tree and makes an artifact which is just a TXC file. We copy it then into our CI directory, sign it, and then a bit later on, on one of the remote servers, a webhook runs that says just upgrade the package in the jail and that's it. I really like this, it's really tidy. It's fiddly to get the applications working the right way in the first place but once that's done, we're good. I'm just gonna do a time check here. We are out of time, yeah. I'll skip through the expert very quickly, very, very quickly. So app summary, immutable containerised apps, read-only ZFS clones, null-FS mounts, nested datasets, syslog to move things out of the jail, no more processes, and immutable deploys via webhook and package tools and the load balances network make that invisible. So immutable servers, this is the next layer down the stack. There's too much smoothing state and key locations to make the whole system read-only. For people who are running appliances, building things like routers in that sort of world, I think it's much more manageable because you have a great deal of control but in my world, we run too many applications that need things in certain places and we can't easily compile them and move them around. And you start to make things so bespoke that it becomes more effort to manage when another person starts and having something that's a more traditional server layout. So, ZFS boot environments, I'm just going to skip this entirely in the interest of time. The key thing here is we have a ZFS dataset for the entire root file system. We can version it, we can mount it, we can snapshot it and we can even send it from server to server as a golden image. And when we want to, we can reboot. If the reboot fails, we can reboot again. The boot environment is rolled back to the original version and we're back at the beginning. So, Pudera is a build tool that we use to build all the packages in FreeBSD and we can also use it to build custom FreeBSD images. We can build it with the FreeBSD source that we want. We can include the ports we want and we can even include an overlay directory which is the custom files that get dropped into it. And the output could be memstick, ISO, ZFS dataset or even a table. So I mentioned tarifes earlier. You can see here how we can reuse that again for our jails here. And for input, we covered that already. Git source tree and overlay directories. So my experience here is that the less you put in the overlay, the better. It's much better to put those things into packages and have them pulled in that way. So we have some custom loader.conf settings, FSTab to make sure we have a tempfs on boot. SSHD is good, we want it on, we want a custom resolver, a tool called SyncBe which we're going to talk about next and our custom package repo if we need it and a list of packages we need to build. So we run Pudrea jail. It builds our jail from scratch, from source. This one's using rallying 13.2, I guess in a week or so or a couple of weeks we've changed that to 14.0. It's building it from Git over HTPS and this could be your custom Git repository or it could be the FreeBSD one here and we want a generic kernel. So that produces a Pudrea jail which is effectively a vanilla FreeBSD server. After that we use Pudrea to build a bunch of packages and that's just a file with a list of the category slash port name. So at the end of this we have the packages we want, we have the operating system we want and we can build an image that contains those two things plus the overlay directory set on the bottom. There's a couple of parameters I'm interested in here in the middle, dash s, swap, host name of none. We want an O for output, where are we going to put this? This is our web server directory so that we can fetch this image later on and right at the top you'll see this image dash t, what is that, ZFS plus send plus BE. There's about 10 options for the dash type parameter but the one we want is make me a ZFS dataset. That's going to be a boot environment that I can send to. So going back to ZFS we're going to take this image off our web server, stream it in, unpack it and then a little bit of tweaking and reboot with our new server. So this is the dream, we now have an immutable image that we can build through CI that we can deploy to any server and we've just got a small bit of customization data to do across. No questions, no loud now. So we're almost at the end here. This is the culmination of this build process and this is just two slides, pretty straightforward. The BCTL command lists the active boot environments. In this case, this system just has two, the default one and I guess that was a preceding version for patching, what is that? 13.1 release, yep. We then could have fetch our package image that we made, the ZFS data stream and we're going to pipe that in without doing any checks I'm checking for the sake of simplicity on the slide and then we're going to pipe that into the SyncBE tool from Clara, from Rob Wing and we give it the new boot name environment we want. We want a 13.2 release boot environment and we pass it a config file which we'll look at shortly and you can see here it receives the ZFS stream, it's unpacked it and it took like 35 seconds which is pretty fast for a full server deploy. It's pretty awesome really. This script isn't finished but we need to have it on the second slide. Sorry, next slide, this is what we want. SyncBE, here we go, here's the last piece of it here. So the main catch with SyncBE is you can see here the ZFS data set's been streamed in but it's the same for every single server, hostname, SSH keys, certificates, all of these things should be unique per server but aren't in our template image and SyncBE is just a little script with Samira checking that says copy these files from the current server that's running. Okay, so it copies them in and at the end we reboot. So my main takeaway is we tried to go to the appliance mode where we changed the configuration files in FreeBSD which is easy to do but it made life confusing for people who had to maintain it and SyncBE with ZFS data sets is much more natural and it's really easy for people to understand oh, this script, we need to maintain the list of things that need to be copied in. It's not ideal but it's good enough. So last but not least, tarifes and then we're done. So tarifes, as I said from Clara and Juniper, allows us to mount a tarbol as a read-only file system. We can apply jails, nullifes mounts, all of those things that we're used to into this as well. It's coming in forwarding data release. It may not be as fast as other file systems yet but I'm sure that will come. It only supports plain tarbol, not GZP, not GZP, not XZP but it does support ZFS, I'm sorry, Z standard compression and it's so simple, it's really all there is to it. We take our release tarbol, unpack it, take the compressed tarbol, unpack it, mount with tarifes, mount a devifes and a tempifes so we have somewhere writable, create the jail and on the next slide we should see some stuff happening in there. We're not allowed to make a directory, the error that comes back is a little weird but we can see that our temporary file systems and our devifes are there and it's lovely read-only thing. So in closing, this is where I'd like to be next year to have our applications deployed as tarbols via package and our jails would just be restart jail and there I think I'd be pretty happy with that. Okay, thank you. Thanks for hanging on a little bit longer. We lost a bit of time with setup in the beginning so may I look at questions? Maybe that's an array of just two. Any questions before we roll for the day? Go on. So in the example, I don't use any compression. So the question was, we don't use compression for tarifes. So in this specific example, I don't but if I had used Z-standard compression, I could. That's not really a question I can answer why Z-standard versus some other ones but I guess it's about addressing content and blocks inside a compressed file system and finding them easily. Yeah. Don't know if that answers the question enough. Anything else? Great, okay. Well, we'll free the pointer to Dave and I advise going to the social. Please excuse my bad puns. Thank you.