 Hi everyone, so yeah, my name is Hongyi and I work at, well, we're now called RDA, but we will be GovTech soon in a month or so, I'm not exactly sure when, but yes. And I work on what we call the data science group, and it's a group within government that's trying to be engineering and product and data science, but not outside of government, within government so that you have some expertise in making public policy decisions. And this data.gov.sg is one of our projects, and I call it one of the biggest projects. So just very quickly, what is data.gov.sg? So for those of you who don't know, just show of hands, how many of you guys have used data.gov before? Just real quick, a couple here and there, cool. So data.gov is Singapore's sort of like open like data portal. So the government has all this information and all this stuff that we gather just from operations, and they try to share it with everyone, but as much of it as we can. And our mission is trying to help people understand and use public data. And so in order to achieve this, we do a few things, we create dashboards, we write articles about various sort of analytics, this is about 4G connection speeds across Singapore. And we obviously have the data sets themselves, and we also take these data sets and provide them as APIs as and when they're possible. The whole thing is built on top of this open source library called CCAN, which is kind of this open source repository repo thing which handles user controls, permissions, uploading, all this stuff. But that's not what I'm going to talk about today. Today I'm going to talk about this, which is sort of a simplified version of data.gov.sg's production infrastructure. So just to get a quick sense, most of you guys are students, I guess. So how many of you have deployed an end-to-end application that's live and getting traffic from the wild? How many? Anyone? Cool. A couple, a handful here and there. I'm assuming most of you have written like soft web apps or school projects and things like that, right? How many of you have written like a web app before, even just running on your local computer? All right, a good number here. Cool. So the difference between running something in your local machine and running it in production is a whole bunch of issues involving scaling, deployment, working with other team members, as well as dealing with like malicious attackers from outside. And I'm going to walk through how we start from basically a single server like this to this, and why we do each of these steps. So this is a basic setup for a web server. This is data.gov in particular, but this can apply to like most web services. You have a single machine, and on this machine you have your database, which is Postgres. You have your logic, you have your actual core, you know, app logic, which is, which is CCAN with our data.gov extension built into it. Then you have Solar, which is what we're using for our search server. So this is like search and indexing. And finally, you have obviously a local file system for storing your uploads and, you know, images that people add on the site, in our case, datasets. So this is probably what you would do if you wanted to get the server, just up and running really quickly. You would create a server on AWS, you get an IP address, you install and do all your stuff on it. And then after you spend all this time doing all this development, you just point your IP address and point your, sorry, point your URL at it and theoretically your live. People on the public internet can now see your web server. But there's a bunch of problems with this. Firstly, the biggest one is that you do not want to be developing on your web server. You never, ever, ever want to be developing on what is called production. And so you normally want at least two servers. You want at least one web server, which is running in production. And then one web server, which you don't touch. And another web server, which you can sort of like commit on, do tests, share things with your teammates and do all this stuff. This is our setup we use within the team. So there's the production branch, which is what's sort of like confirmed and tested and is currently live. There is, we have an instance of the server, which is what we call, which is basically off the current master branch, which is like the tip of tree that everyone's sort of committed to and has code reviewed and all that. And then we also have a demo branch. The point of a demo branch is kind of just how we operate. So if you want to show something to a different agency or like show, share something among the team or share it, share it out to other people, take a look at, that's a separate thing that does not get, that you know won't get swapped out from under you. So first and foremost, you go from one server to three because you need different development environments for different purposes and identical ones. But as you can imagine, maintaining three separate servers with different libraries and different code bases, keeping them all in sync is a real pain. So what you would traditionally have to do is you would log into one server and then you would run some scripts, you would pip install all your secant stuff and then you would install Postgres and then you install Solar, then you start it up, then you install Pache and then you configure Pache and then you log into your next server and you do that all over again and the next one and do it all over again. And in hope that you did all of them in the exact same way. So when something goes wrong, you can test it on a different server because God forbid you type, for example, your exact same, you type the exact same database key from your production server into your demo server and screw up your entire app. So how do you do this? Well, there's this thing called Docker. So how many of you know what Docker is? OK, a good number. So for those of you who don't know what Docker is, you can think of Docker as it's kind of like really lightweight virtual machines. Like they don't like the column, lightweight virtual machines, but you basically use them that way. The way to think about it is that you create, so instead of just logging into a server and doing all the stuff manually, what you do is you create what's called a Docker image. And then you say, and then you create the Docker host and then you say, run this image. And what that does is that it takes that image and deploys sort of like a mini closed Linux environment or running whatever software and having whatever files in that environment. And for our purposes, this means that we can take one data garbage image instead of doing it all over the place. We can take it and we can deploy it to different branches. That's how we keep the three environments sort of in sync. So we know that all of them will be running the same version of Postgres, the same version of Solar. They have the same file names configured. They have the same parameters and flags that we need. But the thing is that you don't want them to be exactly the same because then what's the point? If they're exactly the same, then you can't do any development. What you want is you want this one to be running a production branch of your code, this one to be running your master branch, and this one to be running your demo branch. And so you can do this manually. You can go in manually. And after you've done it, so what we were doing initially was you would deploy the image and get the base stuff running like Postgres and Solar and C can all that. And then we would go in and then we would install the separate branch on each one manually, which kind of works. It's not too bad, but it's kind of annoying. But what we did instead was we went to this. How many of you know what a continuous integration server or service is? OK, handful. So a CI service is it's kind of like a butler to do stuff when you commit code. So instead of you having to, whenever you have to commit code, there's a whole bunch of things you want to do in a production environment. You have to test it. You have to build your new images. And then you have to spin down the servers, put the new code on it, and spin them back up again. There's a whole bunch of stuff that needs to happen when you do a code commit. And that's kind of what a CI and that's what a CI server is supposed to handle for you. So we went from manually logging into each server and adding our extension to having a CI. And what happens is we have a Git repo, which has two things on it. It has both our actual code, the data.gov.sg code, as well as has this thing called a Docker file. A Docker file is just a set of instructions to create a Docker image, that's the way you think about it. And so whenever we push code to our Git server, it notifies our CI server, which then takes the Docker file and the branch of code that was pushed and combines them together to create the appropriate image and then takes that image and pushes it out to the appropriate server. So for example, if I were to push something to the demo branch, so you guys use source control, you do whatever, merge, demo, demo, commit, and push. It goes to the CI server, takes it, takes the Docker file, builds it together, says, look, oh, it's the demo branch. Given that it's the demo branch, we now create a demo image and put that to the demo server. And all this happens automatically without us having to do anything every time. It saves a lot of effort having to wrangle deployment. And the really cool thing about this is that because you're using Docker, which is sort of containers, and all Docker servers are supposed to be the same. I know I talk a lot about it, but it really saved us a lot of time. Is that there's no reason why you can't have it run locally on your laptop. So this is a really, really, really big deal because replicating production environments in your local environment lets you test things way more easily because instead of, because I'm sure you've developed web apps before, there's so many things you have to do, right? You have to install Postgres, you have to install Mongo or whatever, and you have to install whatever tool that you happen to be using. And if you do all that on your local thing, and you develop the develop, and you go to the server, and you rise to servers using a slightly different version of node or whatever, and everything breaks. The way we've done it is that you've built this image. You take this image, you can put it on the server, or you can just pull this image and run it locally on your laptop in a local Docker instance. And then you can test and you can do all that development work without having to worry about whether or not you're in sync with production. And so just to give you an example, this is the message that we see whenever we use Slack as all of our chatbots. Whenever we push a change to the server, it builds it and pushes it out and just sends this little text message saying like, hey look, it's built, it's live, you can go check it out now. So this is on the just getting a proper working production environment going. You start from one server, you stood out to three because you need the different environments for different tasks. You use a container service because container services allow them to be consistent across all of these places. Then you don't, and then instead of just manually going to each container and modifying them, you set up a CI service so that it automates that task for you. So instead of having to do all this like grunt work of like, okay, I've written my code now and I know it works, but now I need to go and deploy it, you write a bot essentially to do that for you. So that's the basics of deployment. But deployment is just one of the major, many problems that you come across when you start running things in production. The next one, a very notable one is scaling. So scaling involves a few things. One, it involves being able to handle a lot more traffic coming to your servers. And two, it also means another big one is dealing with DDoS attacks. And maybe, and you think that if you just run a small little service and no one cares, but there are a surprising number of very bored and very malicious people out there who really just will hammer you with no matter who you are trying to do a good thing. So you need to figure out how to defend against that. So a bare server, even with all replicator across Docker and all that sort of thing, is basically subject to just general load problems and DDoS attacks. So every request that goes to data.gov.sg, you type in your laptop, if you write a bot to do it, it hits the URL and it goes straight to our server. And as you can imagine, it's very simple. You put too many requests, server goes down because it can't handle all these requests. So what do you do? Well, the first thing you do is you have to set up, you should set up a CDM. We use Cloudflare, there are a bunch of others and they're all good and bad in various ways. So there's Cloudflare, there's Akamai and there's a bunch of others, but they all work fundamentally the same principle, which is instead of having one server, which is like, this is the thing, please don't talk to it too much, otherwise it gets tired, all of these CDMs have a network all over the place. And it's kind of like, if everyone's on board, you can't take down the entire internet at once, kind of philosophy. And so they have Cloudflare, for example, has servers all over the world and even multiple in Singapore and all that sort of thing. And so when requests go to data.gov.sg, the URL, they get split across, firstly, all these different Cloudflare servers. So just at the front of it, do you have a much wider, a much broader face to take the load? But if they just took all this load and sent it out to your server, that wouldn't actually help you all that much. The big thing that they do is, firstly, there are two big things they do. One is that for static assets, CDN sends for content delivery network. And I'm sure you see this on Facebook, this is why Facebook doesn't have to send, it can have so many pictures all over the place so quickly, is because for a static asset, which doesn't require any logic or any computation, if it's just like an image, I can replicate the image, share it across the CDN, and it never has to go to my server, because the request will say, I want profile picture.jpg. And instead of having to go ask your server every time you want profile picture.jpg, it just gets cached at the CDN and gets served out. So that already cuts out a good chunk of traffic. The next thing that it does, apart from just serving like caching, is actually a very intelligent thing of detecting when you're under, when you're under sort of like DDoS attack or undetecting bots. So I'm sure you guys would use websites here and there, you've seen that thing where like, please check this button to make sure you're not a robot, even though you're using an actual laptop. That is there because it's very, like detecting when botnets occur or detecting when DDoS attacks occur and like scrapers and all these things go about is not a simple task. It's actually a very, very, very complicated, heuristic thing where you're always kind of half guessing, okay, like we've got a bunch of requests coming from here this time at this place. And it seems to be that they're looking for the same thing. So most likely it's a botnet, but I'm not sure. And that's exactly what Cloudflare and all these other CDNs do, where instead of you having to figure out how to do this anti bot and anti spam detection itself, they just handle it and you can tweak the parameters if you want, but it hopefully filters most of it out so that the people who actually get, so the people who actually get to your site eventually and the people who you actually spend server time and like cycles creating web pages for are actual human beings and not bots. But at some point, even with all the best intentions in the world, even with a good, like great CDN that blocks all these things and you cache all your images and cache all your HTML, at some point your server traffic is gonna get bad enough that you're going to need to have your server to be able to handle more things. You're going to need to be able to handle actual genuine searches, actual genuine requests, actual genuine logic that needs to change, that needs, that's sufficiently different that you can't just replicate it. You need to have a server run that computation and do, execute your program. You can't just have a static pages. So what do you do? Well traditionally, traditionally I mean in like, I guess like 10 years ago, something like that, and even nowadays you would just make a bigger server. So you would get an AWS or whatever you can get, 16 core, 32 core, whatever 64 core these monstrosity, like multi $10,000 things that IBM only sells if you have like $10 million to spend with us on that. And you just buy bigger and bigger and bigger machines and hope that it gets big enough and hope more slow outpaces your traffic. But that turns out not to work. And especially with Google, with Facebook, with Amazon, you can't like, you can't like, you know, you're right. Maybe if you run, maybe run like NUS website for people booking their CCAs or something, sure. But if you're dealing with like billions of requests a day, there is no single machine or even like server room, full of machine, server room size machine that can handle all that requests. So what do you need to do? Well, the first thing you do is you split your services out, you split your services out from the core server because it's very, very, very hard to scale a very complicated single thing because there's a lot of internal dependencies. So this is a basic data.gov server. You take that and you go to this. So instead of running one server which has everything installed on it, you have a server which handles the logic. You have a server which handles Postgres, the database, you have a server which handles the file uploads, you can do this by NFS and one of these other sort of like remote file system things and a search server. And then now that you've split it out into four separate servers also talking to each other over RPCs, what you should do next is you should try to switch to hosted solutions when available. So just covering this roughly. Instead of using Postgres, we use Amazon RDS, it's a relational database service. And the advantage of that is that, well, just as you have problems with a single server, it's really, really, really hard to build a distributed reliable database. It's a really, really, really hard problem and you could spend entire careers building a reliable one. And if you did that, you would probably end up with something like Amazon RDS and so why not just use that? And that handles all the scaling for you. It does all, instead of having to worry about, oh my God, we have all this but we need more database calls and databases to replicate and blah blah, just use RDS. It handles, it gives you a clean interface and that works. Similarly, interestingly, instead of Docker, here we switch to Amazon ECS, which is EC2 container service. This is actually still in Postgres, we haven't moved everything over to it yet, but fundamentally, just as I saw, just as I covered earlier with Docker here, Docker is about running it, you run a Docker game on a machine and then you put your images onto that machine. So that way your machine kind of becomes like hotswappable in a sense, instead of having to set it every time, you can just sort of tell it to run this image and then cancel that image and run a different one or run multiple images. But why bother having to set up that initial server to even begin with? What EC2, what something like a container service does, what something like a container service does is that it's a software service, just like RDS. You basically talk to an interface via an API and you go, here's my image, please execute it. And by doing that, you essentially are able to get as many images or as few as you want and you never have to worry about, oh no, we've run out of containers and we now have to spin out more EC2 instances and how big should they be and how can I make sure these EC2 instances run the same version of Ubuntu as the previous ones which they're incompatibilities. No, just use ECS. And finally, the last thing we use was NFS or the Amazon version of this is an elastic file system which is a bit tricky to figure out because CCAN by default does not allow for, so with databases and search servers and with Docker, by default they all handle some, they're designed to sort of work remotely with remote services. But CCAN, just the library by default, assumes that you're writing things to local disk and when you're writing things to local disk, the problem is that I can replicate my database and I can replicate my search servers and I can run it on container service but if each instance of CCAN has its own different uploads, it's weird that I've uploaded a picture to one instance and someone else logs in and they happen to have a different one, they won't see the same results and they won't see the same data set in this case. So elastic file system is Amazon's implementation of NFS network file system. Do you guys know what that is? Any familiarity? Yes, no? Okay, basically it's a system that you can mount a remote file system and for most intensive purposes it pretends like it's part of the local file system. It's a bit slower, it has some other issues here and there but it works more or less for our purposes. And now that you've split it out, you can start, you can do this, which is you can now replicate your call logic. This is, because remember, RDS solar elastic file system, none of that is what we're interested in because that's all commodity software slash hardware. What we've done is written data.gov. We want people to hit to see data.gov and so rather than spend time figuring out how to scale these things, by splitting them out, you can now scale your call servers. So you can just make a bunch of them, put them behind a load balancer so that when Cloudflare, so when Cloudflare sends traffic your way, legitimate traffic your way, you have a bunch of different servers all handling, all being able to handle logic for them and if you need more, you just increase the number of servers. So now I can go from four to 10 to 20, 20 to 100 if I wanted to and because they're all talking to the same file system and the same database and the same search server, it doesn't matter which of the front-ends you hit. So you're separating your data from your logic layer and that's probably one of the big things about figuring out how to scale your app. Making sure that your consistent data across all the front-ends, all the sort of front-ends but therefore allowing the front-ends themselves to sort of be replaceable and scalable in a very sort of easy-to-go manner and if I need more database requests or I need faster access to the file system, it's very easy. If you're using a hosted service, most hosted services, you can just say I would like a bigger database please and it would give you a bigger database. It requires maybe like two minutes and like 10 clicks in a browser. So it saves you a lot of trouble. So that's the basics of how you scale but the last thing I wanna cover about scaling is that well now that we've split this out, previously remember how we had to do, remember how we had to have three different servers for production, for master and for demo. Now you have to do that for this and that's a pain because you need to have, see you have to have each of these multiplied by three and you have to have different IP addresses for each one and it all gets into a big mess. So what we did there was create separate VPCs. So a VPC is a virtual, what they call a virtual private cloud and basically it's a network. That's the way you can think about it. They call it a virtual private cloud because you can create them programmatically rather than physically. And what this means is that within each of these, just as we had three separate servers with identical environments inside the server, within each VPC, each server sees an identical network environment. So for example, the secant instance on the production VPC knows that if it connects to this IP address or this location or this URL, it will find RDS. If it connects to this IP address or this URL, it will find the elastic file system. And that's the same if it's in master and the same if it's in demo, so you don't have, so none of these guys have to know about whether or not, any of the other guys, about which VPC they're in. They don't have to know that they're in production, they don't have to know that they're in their master. They just need to know from their perspective from a server inside the VPC, they see an identical environment, more or less. And again, the advantage of having these environments being identical instead of having to like jigger with IP address and things like that is that you reduce another point of failure because most of the time when you have like a lot of these weird failures in production, it's because you misconfigured something and you assumed something in production that was not true in your development server, for example. By maintaining identical environments, you maintain consistency and ease of development. And similarly, just as you had your CI server, handle deployments, you can use Docker compose, which deploys a whole bunch of things at once. So you go, so instead of going like, I want to deploy CCAN image to container one, you go, you create a, you create a, I forget what's it called, a group.yaml file, something like that, which says I want, which it says for this network or for this set of Docker containers, I want four CCAN instances, one solar instance, one like data pusher instance, one data processor instance, and one validator instance. And you just go in a single command, you just go Docker up and it spins up all this stuff within one server. And so for your, so that way you can operate multiple servers and multiple separate things, which are all loosely coupled, but don't have the overhead of having to manage all that stuff yourself. You can just treat it as if they are different parts within one machine. Yeah. So finally, the last bit I want to cover is about backups. So we've talked about, we've talked about how you set up your development environment, we've talked about how you scale the production environment so you can handle all the traffic, but now you need to find out, figure out what to do when your server goes down. So if you're going back to the single server, simple use case with one server, server crashes, but could be many things, could be that Amazon dies as a company, it could be that your hard drive failed, it could be that, and I'm not even kidding about this, cosmic rays have hit a bit of memory and RAM and cause like some fault. And it has happened, because you're talking about your laptops here, yeah, the chance is no, but when you're talking about, let's say you're Google or Facebook and you have like tens of thousands, hundreds of thousands of servers, this massive server farm, the chance that cosmic rays is going to hit one of them isn't a weird off case, it's an inevitability, it happens like probably once a year out of like 10,000 servers, but it will happen, and they will just die randomly. And if you only had that one server and that was your entire web app, your service is dead, so what do you do? Well, for us at least, there are a lot of answers to this question, but for us at least, this is our answer roughly. So because we've split out our stuff into separate services, we have a very clean and simple way of doing backups. So for the code itself, the code itself on the server is unimportant, because again remember, we've set it up so that containers are sort of these throw away things that are replaceable immediately, and so we push everything to our Git server which automatically builds and deploys our instances. So if any of those instances die, no big deal, spin up another one, like you could, it doesn't matter if they are corrupted, it doesn't matter if the hard drive fails, it doesn't matter if Amazon dies, you can just spin up a separate one on a separate container service even. For our database, well, there are a bunch of ways, there's a whole bunch of work around like how to handle database backups, especially if you want like the latest, latest, every single transaction that's committed can, is recoverable. For us, it's public data, we can get it back again relatively easily, we just don't want to lose too much of it, too much of our work. We just do daily snapshots and dump the files locally, so that way you just have them on hand. And finally for the file uploads, we just have a daily cron job that dumps everything to Amazon S3, and in a very sort of nice convenience step that they've provided, they have an archiving service which automatically migrates stuff from Amazon S3 to Amazon Glacier. So if you guys are not familiar with what Glacier is, so S3 is simple storage service, it's just like put file here, get file back, it's a pretty simple semantic. Glacier is something like put file here, get file back, except it's really, really, really slow, but really, really, really cheap. So instead of like putting a file and getting back within milliseconds or even sometimes seconds in S3, you put your files, when you migrate something to Glacier, the cost drops by like one, like after 10th or something like that, but it means that when you want to get that file back from Glacier, it takes like maybe 30 minutes for them to like go through and read the tape out and things like that, like absurd long, long, long times, but it's good enough for archival purposes. And this way we get to have a live backup whenever we want, but we're able to pull this long-term storage in case like, in case you realize for the past year, we've been writing rubbish, like on our website or something like that, you can recover from there. So this is our backup system and this makes the recovery procedure very, very easy. So let's say you had, let's say meet your hits and destroys our data center. What do we do? Well, first thing you do, you set up a new VPC. You spin up a new database. It doesn't have to be RDS, it can be whatever, because it's Postgres, we can spin up whatever database you want and you just put your, you just take your database down which you stored hopefully off-site and in a safe location and push it back to your database. You do the same for your blob stores. So this is all your big file uploads that are not on the database. Just dump everything back from S3. And finally, we just rebuild the images and deploy them using our CI service. And then you rebuild your search index and just, and finally, you point your URL to your recovered system. So this is the failure scenario which is when everything goes wrong, what can you do? Which you need to do when you're writing production applications. So overview and conclusion. So what we've covered today are basically three main things. The first one is how you handle just basic deployment which is being able to handle separate environments and how you push code to these environments. So maintaining separate instances using a CI server to crunch your source control files, create a new image, push it out, do the deployment, do the testing, do all that. That's deployment. The second thing we've talked about is scaling. And scaling is basically, and scaling in this case involves one, putting a CDN in front of your server so that it handles all the crud from the internet and tries to minimize the stuff that gets you that only needs to talk to your server. But two, spitting out your server into its disparate services using hosted services whenever possible and therefore allowing you to scale the parts that you really need to scale or replicate the parts that really need to be replicated. And finally, we talked about backups which is in terms of durability, what to do so that your hard work and all this effort building your next hot internet server doesn't evaporate overnight because literally a cat went into the power station and fried the entire server because it has happened before. And what you do here and basically just make sure, for us, at least there are plenty of ways of handling backups. For us, our code is backed up on our Git server which is a whole bunch of places on everyone's computer so we're not too worried about that. Database is dumped daily and files are dumped daily as well. And so recovering is just as simple as taking these three pieces, putting it back together in a box and pointing your URL at it. So conclusions, container simplified deployment, container simplified development and I cannot emphasise like this enough how much of a God send it's been to be able to develop Docker locally with the same thing that's running on a production server without having to spend all this time installing all this crap just to run things on your local machine. CI service is a bit of work to set up because you actually have to put in a bit of logic of which branch goes to where and this branch to this, this branch to that, like a bunch of different things but it is really, really, really worth it if you're gonna be working on something for like even like six months. Like it is worth setting up a basic CI service just to do the pushes so you don't have to handle this, like go through the deployment manually every time. Use a CDN. It solves so many problems for you that are faced by so many people and do not have to solve these problems again. You should trigger, you should trust someone who is smarter and like spend more time doing this than you. Similarly, split for scaling, split your server up in its various services and use hosted services wherever possible. Same argument as a CDN. Don't try to like rebuild everything from scratch because you think you're smarter and it's not quite your use case because chances are you will start building it and you will realize that the people running, like the people running AWS or Akamai or any of these places have spent quite a bit of time figuring out all these edge cases and you really want to tap on them as well. So unless you are someone, like if someone like Netflix for example uses like AWS and runs there and to manage their infrastructure for them, you are much, much, much, much smaller than Netflix and you are not as smart. So try not to use, so seriously try not to rebuild these things because you will regret it and you will regret it if you do. And finally backups are something that a lot of people forget but it really doesn't have to be that hard. Like handling really, really tip of tree, everything that's changed backups is difficult but just having some basic backup system where if someone stole literally or stole your server hardware away, you can recover it relatively easily. It's not that hard to set up and you really should do it. Cool, so that's the end of my talk. I'll take any questions. How do you cut down your AWS? So say that again. During your lean time, I mean like when you don't have much traffic or you have time, how do you cut down your AWS build? How do you cut down your AWS talks? So depends if there's a bunch of ways you can do, like AWS has like so many different services that it's really hard to figure out. There's a free tier for mostly everything. Like our side when we were doing development, we lived on the free tier for like the first six months because we weren't live yet and there was no reason to. And even after we went live, a couple of things were still on the free tier because we didn't have very much traffic. So live there as fast as possible. If you are, the honest truth is that the biggest, like if you are in a case where you're really, really, really taxing your AWS resources, you are probably already, like there's probably not much I can do for you. But if the biggest source of you wasting money on AWS is buying big things with the hope that you will need them later, right? Like you buy the Excel, the Excel size EC2 instance at a triple Excel size like database thing, which costs you like $5,000 a month or something like that, hoping that you need it later. And the thing about using the cloud is that, yes, if you're buying physical hardware and going to shop and putting on a rack, sure, you need to do that because that's annoying to do, like buy a new hardware at a time. But literally to spin up a new instance every time, it takes like two minutes to change. Log in, you click, you click like I want a bigger thing and they give you a bigger thing. And yeah, you might experience some app downtime while they transition to the new system. But I would say that just, you don't have to be, you don't have to buy too big a thing, hoping that, you don't have to buy things in expectation of what you need later. Just scale it up as you need it and you'll be fine. Yeah. Engineers in your team, you've got shell access to production servers and if so, how do you manage key access? Well, engineers in my team, you mean the three of us then? Yes. Yeah. So yeah. So, okay. So we have a bunch, so there's a bunch of teams. We have like 10 engineers total. We are small enough that like, we don't worry too much about who has access to what, like it's really just literally all sitting in a circle. For data to go specifically, yeah, key management is a hard answer problem. Like more than, there are more passwords on post-it notes than I would like to admit. But we are trying to move towards like various systems of password management. So there's a whole bunch of these, there's like last pass, there's one password, there's like team pass, there's a bunch of these systems. None of them work perfectly and then like, in the end problem between keyboard and chair, like you, if you have to educate and train people to use these things like as much as possible. So you have to pick something which is, which gives you security, but isn't a pain in the ass to use. Cause if you, if you make it a pain in the ass to use, you will end up with like the SAF problem of like every laptop having the post-it note on it. So. Yeah. Let's use orchestrically. You are talking about the names. Do you guys use the names? We don't use anything really sophisticated right now. Okay, yeah. So basically we just write a bunch of shell scripts. That's what, that's our current scenario. We are trying to move towards like more post-it solutions. So for CI, for example, we're trying to move towards Travis. Our current CI is really just the thing we threw together just to get up and running cause we were sick of like logging into the server. But yeah, I would say that like having tried out a few, like cause we were trying to transition, a lot of them do a lot more than you need. So I wouldn't fret too much about it because you're right. There are a lot of edge cases. And if you are at that place, sure, then you care. But if you're just trying to do something, which is build, build image, push code, almost any of them will do fine. Yeah. Anyone else? How did you, how did you get into that? And how, how did you get into that? Yeah, sure. So I talk about this is if we had all this running before we went to production, we really didn't. So a simple one was, we had a case where, we were running aside, we launched it, you know, David Garvest, check it out, it's nice, whatever. And some of you guys visited it and that was awesome. But it's still like, you know, X amount of traffic, right? But then we had, but because we have all our charts, Channel News Asia picked up one of our charts. I think it was something like drug use or drug, criminal case or something. And they published an article about it. And the thing about charts is that we tried to be clever instead of serving images. We were like, we want dynamic, interactive, live full data charts. And so this got pushed on a CNA website. And so the amount of traffic went from X to like, you can imagine, like much more than X. And so this was, and embarrassingly, this was like, this is like your time to shine. But like, people are coming to your site. And so luckily, luckily, we had put Cloudflare in front of it and Cloudflare being the very smart people that they are. Detect when your site is down and serve up a cashed copy. So their site, so you can't, you couldn't do like, some of the more like interesting interactive stuff. But you could at least see the site was there, see the rough content and the image and the text describing it. So it wasn't too bad, but we were, but, but yes, that was wonderful. Another one, interestingly enough, funny enough, was when we, we gave a talk at the AWS conference in Singapore and giving a talk at AWS, AWS conference talking about your website suddenly brings a lot of people from around the world on to your website. So yes, that was another, another case of stuff right now. Things we learned from it was, well, why don't we figure out a lot of stuff we could cash a lot more because so a lot of these CDNs by default will cash like images and one for other things. Data.gov, because it's not like Facebook can't cash everything because every person has a different Facebook feed. But Data.gov, the only people who log in to Data.gov are agencies trying to edit data. So for the average member of the public, their HTML is more or less the same. And so we, we set it aside, caching HTML as well. And that saved us like a lot of server loads because you don't have to run your like templating engines and all that every time. It stays the same. Like data sets don't, the most data sets don't change every, like every, every two minutes. So on caching HTML, the second, the second thing we learned was, what was, how do you describe this? We're trying to figure out, I think we're trying to figure out like how to test our, how to test our server load. So before this, we just looked at it and was like, okay, yeah, that, that seems fine. I think we're doing things more or less correctly. We followed more or less the guides. It should handle load. But actually if you want to test things in production, you have no idea what kind of traffic you're going to get. So the correct way, well, I don't know whether it's the correct way, but the way we do it is that you would take, you just take a sample of incoming requests, like over some massive load day, like over, like, I don't know, a couple hours when your site went down, for example. And when you spin up your demo server to test whether or not you fix that problem, you just replay it. Like you basically create a bot that does nothing but yell requests at your, at your test server and see whether you have fixed that problem. It's similar to, basically it's this, like similar to how you like debug code, this is how you debug production. Things you have to expose it to the same, to the same antigen and see where that feels. And it takes the second time. Yeah. Cool. Any more questions? Yeah, wish. Netflix has this crazy thing called Kiosk County. Yes. Which is a tool in production that randomly goes around and flux out and kills random services that they have like in production. Yes. Just to make sure that things have always worked. Yeah. Random services are done. Like, are you guys exploring something like that? I don't think we are chaos monkey proof at the moment. No, you're right. So chaos monkey is a perfect example of how you actually test your stuff is working because you basically just continuously expose it to what you think will go wrong. And you know, and I think we are not at the stage yet where everything will work nicely because we don't, we don't, it's not justifiable at the moment that we hold everything, that we have replicas for everything because we, as you said, we don't need that much. Even the smallest service or some small services are already too big and replicating in three or four times is just a waste of money, generally speaking, and the failst scenario is too low. So the question is often not so much about like what, whether or not your site is bulletproof because every single site, like literally every single site, that even Google.com and Gmail, as you well know, goes down sometimes. The question is what are the failure scenarios that you have designed around and what you're willing to design around essentially. So I used to work on the infrastructure team and you would look at things like, okay, what happens if the cell fails? Well, very simple. For Google, at least they have this system called Borg where like every, instead of running on physical machines, you give the Borg master task and it spins up a little virtual machine somewhere in the data center. You never even know which machine it is. And so physical machines dying, not a problem anymore. But you have bigger problems, like literally entire data center's going out. So redundancy in most people's cases is like, okay, we have two servers, one server goes down and the other server goes up. But there was one failure where literally a cat got into the power station and fried itself and the entire data center. And so now you need to design not just a service failing, but entire data centers failing. And then you have even bigger outages, like you have for example, I think there was a case where like a shark went on like the cross-Atlantic cable or something like that. It was something else. And so you have like massive network partitions like splitting ends that's, I'm sure over the past few years, I'm sure we had cases where Singapore can't reach American websites or they're really slow. It's because the cross-Atlantic buddy cable can die. So for us, we've handled, I think we've designed around most of like the common failures which is like basic DDoS attacks and basic cell failures. And we've got backups in the office, but theoretically, if you pulled out the right plugs in our system, it would still go down. We're hoping that it doesn't matter too much because the data is all like agency data anyway, so we can recover all of it. So it's a question of how much effort you want to design about how absurd a scenario. Cool. Any more questions? If not, all right.