 Mi yw Neil Amatage. I'm a deployment engineer for Continuant, basically Continuant provided a software platform for making highly available database clusters. I'm a DBA for about 25, 26 years, which is a scary number nowadays, and I do deployments for Continuant. We tend to do two or three deployments a week, and these are all done remotely, so I actually live locally here, but work generally around the world. I'm also a developer for their cloud operations team, so I sort of wear two different hats. This talk came about, because we were doing an awful lot of deployments for customers, and they suddenly wanted to go in the cloud, and they hadn't really thought about the differences. They'd been in on-premise deployments for years, essentially, and they thought they knew how to do things, and then they started to move into the cloud, and they were using the same mentality. So, if you do an on-premise or a co-location deployment, you'd generally raise a project, then you'd need to get that signed off various levels of management, then you'd have to order the hardware, which could take sometimes a month to arrive. Then you'd have to get someone to rack and cable it. In a large company, that would be a totally different team. Then it would need to be connected to the network, which, again, could involve paperwork, get an IPs. Then you'd normally have to get someone else to install the operating system. Then it would quite often need to be connected to the SAN. And all of that was different teams. And that could potentially take weeks, months. I've actually seen stuff like that take years. And you end up with a nice rack of hardware neatly cabled. Some guy, normally me, would spend an afternoon making the cables and all neat and tidy and labelled. And then we started moving into the cloud. And all you needed to get your data centre up and running in the cloud was someone's credit card. You'd call an API, an open stack, that would give you three servers. And that would do the same in Amazon. Bang, you've got your servers up and running. Quite often people would go and deploy in production like that without really thinking about what they were doing. There's various cloud providers out there. Amazon is still the king in this market. We deal 90% of our cloud, actually 99% of our cloud deployments are in Amazon still. We deal a little bit with Rackspace. I haven't come across HP Cloud yet. We do an awful lot of on-premise clouds in VMware. That's quite common in big companies. I haven't seen open stack in the world yet, but we use it internally for testing development. So in Amazon, if you spin up a server, where does it actually exist? Amazon have seven or eight data centres all around the world. If I'm working with a European company, we're normally restricted using the Irish data centre, because in the EU you want your data within the EU, essentially. So that cuts down the number of data centres you can use. Each Amazon data centre is then broken into regions. You have Amazon regions, and then within that region you have availability zones. So those availability zones normally refer to a single physical data centre. So in North Virginia, they have five data centres which make up US East, and they're normally labelled A, B, C, D, E and F. What I have found out is, on my account, US East B would be different to someone else's US East B. So they seem to switch the letters around so we don't all get the same data centres if we go for B. So if you provision a node on the cloud, you say, I want three nodes. Amazon will pick a bit of hardware and stick those nodes on them. You don't know if, in this case, I've provisioned three nodes, and two of them have ended up on the same physical hardware. You've got no control generally with Amazon where those nodes are going. They will just pick a bit of hardware. So in this case, if server A disappears, you've suddenly lost 66% of your servers in one file-swip. The same with the data. Amazon store data in what they call EBS. Rackspace is a similar thing. Your data can end up on any disk and it can be spread around various disks. You don't know where it is. Amazon will claim they have two copies of your data all the time. I assume it's true. It's the same with the networking. We generally find networking in Amazon is very unreliable. You get an awful lot of dropouts. Servers can go be disconnected without you knowing about it and they'll come back milliseconds later, but we're doing high volume data transfers normally. That millisecond outage will disconnect our processes. So we've had to re-architect our software to work in Amazon specifically. I'm assuming it's going to be the same in all of the other clouds. You don't actually know what the networking is. They just provide you with a connection. It can be whatever they fancy. There's obviously a security consideration about moving data around cloud environments. You don't know where your data is going. When we used to do high availability, I used this phrase we used to do and it was only probably a year or so ago I was doing this on-premise and this is considered old hat nowadays. You would make sure you had two servers in different racks. So if you had a master and a slave you'd make sure the guys racked it in two different racks. All the machines used to have redundant power on them. So if one power supply failed, the other one would kick in. And they would normally have two nicks as well so you'd have redundant network. And you'd have redundant disks so it'd be raided or backed up onto a sand. And if you were going to deploy a backup server you'd have it in a separate location and you'd normally have two network connections to your backup data centre. When you get into the cloud, it's sort of very different. And we're finding cloud reliability certainly in Amazon is quite poor. Amazon has quite a lot of outages at the moment. I think we've had five or six in the last year and it's taken out companies like Netflix so you need to engineer your database layer to handle these failures. You're now going from... Failures used to be rare in your on-premise locations. I can't personally remember losing a server in probably ten years. It was a rare event. And now we're getting from our support desk we're getting people losing Amazon instances every day. So you need to think differently. I mean Netflix has had quite a few outages in Amazon recently. How do you rethink how you need to rethink how you're spreading your database servers? If you're only going to deploy two which we try and talk our customers out of nowadays you want to make sure they're in different availability zones. So if Amazon loses an availability zone you're still going to have one machine. What we try and get our customers to do now is at least have a backup in another Amazon region because we have seen the whole US East drop-off line and it's quite painful for them. What we prefer, this is how we run our own internal systems as well we now replicate out into rack space as well so we're not dependent on Amazon at all. So if Amazon goes completely we've still got our databases up and running. Basically we've seen enough problems with Amazon that we've now got a bit paranoid about one day Amazon may go down. It's going to be a fun day but hopefully it doesn't come soon. So you need to consider locations where you're going to deploy these servers to. You need to have more nodes than you would. In the past I would quite happily recommend a master and a slave for databases and I'd be fairly certain we would only ever lose one of those machines at once. I got away with it generally. In Amazon anyone deploying with two nodes gets a strong warning from us that it's probably not a good idea. Nodes in Amazon can just disappear. We've seen this recently, I had a customer this has actually come out of one of our support tickets. The master database just disappeared. The whole server just went. We never found it. Amazon never responded to their support tickets which is quite normal for Amazon. We recovered the EBS volumes eventually but the node never came back. That's fairly normal. Basically you need to assume you're going to lose some nodes and don't assume you will be able to reprovision nodes quickly either. When Amazon have an outage quite often their APIs get saturated by everyone trying to spin up new nodes and people like Netflix will have priority over us on those, getting those new nodes up and running so the big customers will have priority over us. I've been in a position where there's been an outage I can't spin up new nodes. I'm just in a queue waiting. Make sure you back up everything many times. S3 backup is cheap. It's nothing. The cost of backing up 10 times into S3 is irrelevant for most big businesses. You can also deploy and find you've got a bad node. Basically they come up and they run really slow. There's no reason for it. They can have bad performance. You could be on a machine where another customer has got a virtual machine which is doing something a bit weird and all of a sudden your node just gets really slow. You need to get into the mentality that our customers will go on to those nodes and start trying to debug what's wrong with them and we'll get a support ticket saying we've run some so tools. I'm getting bad performance. They've just wasted half a day giving us that information. You might as well just throw it away. It's pointless. If you've got a bad node, throw it away and get a new one. I think the phrase is it's pets versus cattle. In the old days you used to treat your servers like pets. You'd look after them, you'd care for them. With cloud environments, treat them like cattle. If they're useless to you, just shoot them, get rid of them. You're wasting your time trying to fix a bad node in Amazon. By the time you've probably never worked out what's wrong with it, Amazon very rarely respond to support tickets. Normally because we're running master and several slaves, that slave will suddenly get really far behind. You'll get a huge lag on that slave. No, you can even log on and see it's slow. Just create one terabyte or one gigabyte file on one machine and another and it will be an awful lot slower. You're just wasting your time trying to fix it when you can deploy another node in an hour and copy the data across. In the cloud there's various ways of running databases. Amazon have their RDS service. It runs MySQL, Oracle and SQL Server. It's very popular. You can have read-only slaves in it. It spans multiple availability zones within one region. You can't replicate your data with RDS across to another region. It's simple to set up and use, but you're locked into RDS, so you're committed to using Amazon's tools. It's AWS only. You can now replicate the data out. They've changed it recently, but you can only replicate the data out to migrate to another platform. But you can't have your main database running in Amazon, RDS, replicating to something in Rackspace. There are some options for replicating data in. Our open source replicator I think is one of the few tools that will replicate from MySQL into RDS, which actually got one of our engineers in trouble because we're actually a competitor of RDS, and we provided a tool to replicate the data in. Because it's open source, we can't pull it in. So there's no multi-region support. Failover between, if one of the nodes goes, their failover can be up to 10 minutes. I've certainly seen it take 10 minutes to failover between nodes. But in some cases it's quite acceptable. It's not a problem for some people. Rackspace, they have a product which is MySQL only. As far as I can tell, it's single node. There's no replication built into it. The documentation is a bit better on what they provide. There doesn't actually appear to be any way of doing backups from it either. Google Cloud have a solution. As far as I know, it's Google only. There's no way of replicating in or out of it. HP Cloud are running Trove, I believe, which is part of OpenStack. But at the moment, Trove doesn't support any form of replication. It's single node only. And OpenStack itself has the Trove product, which is not in Havana, and it's probably going to be available in Icehouse. As far as I know, we're slightly involved with Trove, hopefully, to do the replication side of it. We'll get to Hong Kong next week to discuss it with them. This is how we deploy stuff in the cloud. We generally are customers split into various camps that we've got a lot of big customers that want it on-premise. We deal with people like the Financial Times, Marketo, and a few other customers I'm not allowed to talk about. And they've got it in a private data centre. They're quite happy with that. We've got a lot of customers that are in Amazon only, running, I think, I like, we've got biggest customers, about 100 clusters in Amazon. So quite large deployments in Amazon only. And they're used to working in Amazon, and they know they're going to lose nodes occasionally. Quite common, we're doing a lot of setting up backup and DR sites in Amazon. So if they lost their main data centre, they can fail over to Amazon. Or we've got people that are running Amazon and back up into rack space. Generally, they're looking for flexibility, and they want to put their data where it's most cost-effective for them. So they're looking... Generally, they use us because we can move their data into another cloud if they want to, or another on-premise location. We don't care where we work. Most of our on-premise clouds are VM where it's still hugely popular. Basically, they want to spread the risk of their data. If their data goes, most of these companies don't exist. If constant contact loses all its email addresses, constant contact doesn't exist. That's another one of our bigger customers. So they're very paranoid about their data. We deploy in the cloud, we automate everything. We don't rely on me remembering to do a certain set of steps on each machine. We use Puppet to work across clouds. So we started off going down the route and we had build images for each cloud environment with all our prerequisites built into it. But we found that was too difficult to maintain, so we pulled all that out into standard Puppet scripts. We now just use standard OS images. Generally, we just pull a CentOS 6 image from the cloud provider we're going to use and customize it with Puppet. We run in what we call masterless Puppet mode. We don't have a Puppet master because we find that would introduce a single point of failure. If you've lost your database server, your Puppet master may have gone as well. So before you can get your database server, you've got to re-provision your Puppet server first. So essentially, we just pull Puppet modules onto the local machine and run them on the local machine. We've got install nodes that will work across multiple nodes in parallel now. So we're deploying large clusters very quickly. And we've got a prototype GUI which we're using generally for sales demonstrations because most real CIS admins don't want to use a GUI. But it's very powerful for demonstrating how we deploy in the cloud. Underneath there, there's an API that we actually use from the command line for deploying clusters. Generally we've found most cloud providers don't provide a secure way of communicating between their sites. So if you've got something running in Amazon East, something running in Amazon West, all that traffic is going across the public internet. So you then need to provision your own secure way of communicating between the data centres. We used to deploy OpenVPN on each site, and then we would need to put in a second backup OpenVPN server in case the primary VPN server disappeared. So we were running multiple OpenVPN servers. And it got to the point where this was getting annoying, so we've actually now beaten up the developers and we've actually built in encryption in our own software. So we're encrypting everything. All our network traffic is encrypted within the software, which means we don't need to worry about setting up VPN tunnels all over the place. So that made my life an awful lot easier when they deployed it. So we've just encrypted everything. So I think it took the Java guys a couple of days to do it, and they found it was really easy, which was quite nice. As practises, we raid EBS volumes. EBS volumes are essentially in Amazon's version of a SAN. So you allocate a 10 gig EBS volume, and we found they can disappear as well. So we actually take three or four EBS volumes and raid them with software raid normally on the servers. You get about a 10% loss of throughput on them. We would rather the data is a bit more secure. We take backups of every machine every few hours quite often. We use extra backup, which is an open source MySQL backup tool, which we backup into S3. We also use EBS snapshotting on all of our raid volumes. So we've got snapshot copies of everything as well. And we monitor everything. If a customer hasn't got any form of monitoring switched on, we don't let them go into production. So we use Nagios, Sabix, Email Alerts, Uralic. Just about every form of monitoring we support, and we would recommend having two forms of monitoring wherever possible as well. I've generally found Amazon's own monitoring is not particularly reliable. So we try and use a third-party monitoring tool. The future, where we're coming from, we're finding we're dealing with more and more clouds regularly. This is not a great slide. It's not easy to view. And each cloud has its own API. So we were repeating things over and over again. So we've built an API layer which sits in front of all the clouds that we deal with. So I now say, give me a node in Amazon, and the API will return me those nodes or give me some nodes in our open stack environment. And we get the same. I just get an array back with the details of those nodes. I don't care where they are generally. And it means we can now deploy. I deployed as an example for a customer the other day. It was 15 nodes across five data centres, three in Amazon, two in open stack. And it took us 32 minutes to deploy the nodes, configure them, we pop it and install our software. And probably 31 of those minutes was us tailing the logs watching it happen. So we're automating as much as possible. A demo, which will normally go wrong. My experience with live demos is not great. It's gone. So what I have here is a node I've just deployed. It's in one of our open stack environments. It's got nothing at all running on it. It's just a clean sentos installation. So there's no, I don't think there's any Java on it. There's no MySQL. All I've done is I've installed my, we've got some puppet modules which we use to deploy nodes. And the only thing I've got deployed on here is those modules. So all we do is we set up a puppet command to install the node. So we're just going to call our, our puppet module is called Continuant Install. It's possibly not the most imaginative name for it, but it worked on GitHub for us. And then we specify load of parameters. We say, I want this to be called East DV1. The IP address will come from puppet itself. That means take the IP address of the machine and use it. In this cluster, I'm also going to install another node called East DV2. So I give it the information to go in the host files. And then I just say this cluster is going to be East DV1 to install some SSH keys. Download MySQL Connector J for me. Install the cluster. And it's using a test repo. So I have to give it the IP address of that repo because it's not in DNS. So what I do is tell puppet to apply those, apply that, no, puppet module. First of all, it throws up some warnings because I actually haven't given it a password to use. So it's using the default one. And then it will go through. It's created a group called MySQL. It's created the Tungsten Unix user. It started creating directories. It's downloaded the MySQL J command file for me. It's installed NTP, and it's installed the NTP config file. It's installed WGAT. It's installed Java now. It's creating the profile files for Tungsten user, configuring sudo for me. It's installed that repository I asked for. It's configured the Tungsten. It's configured the host file, set the host name. And when I set that host name, it's actually triggered a refresh of MySQL because it knows if I change the host name on a file, I need to restart MySQL because MySQL doesn't like it when you change the host name and don't restart it. Then it is probably installing our software now, which takes about 30 seconds, generally, on these machines. Actually, no, sorry, it started MySQL. It's configured all the MySQL users for me. Puppet will run in whatever order it likes, generally. So every time I do this demo, it runs in a different order. You set dependencies within Puppet to force some things to run in the same order. But if you don't set those dependencies, it will just run them generally in whatever order it fence is. It should just normally be one run of Puppet. That's what we try and get it to do. And that's finished. So it's running. It's got our software running. See, the host name has changed now as well. So it's configured that for us. And our software is up and running on the machine. And that is something... Before we used Puppet, that would have taken me on a good day. It used to take me about an hour to configure each machine. And that took, I think it was about... It normally takes about two minutes with Puppet. So... And what we can do with this is, while one node is configuring, all the other nodes are configuring as well. We're doing this all in parallel. If you lose a node in Amazon, I normally can get one up and running in about five minutes. It depends on the size of the dataset that I need to copy across. If I need to snapshot quite a large EBS volume, that can take longer. Where are we next with this? We want to do instance autoscaling. That for us is going to be our next big thing. So as we see more and more load on the database, we're going to provision more slaves automatically for you. Our software gives automatic read-write splitting. So it will detect read-only queries and offload them onto the slaves automatically. And what I want to be in the position is, if I see a lot more read-only queries coming in, I want to spin up new nodes automatically to handle that load for us. And as that load comes off, we're going to destroy those nodes. We already support replication to MongoDB, but we have to configure it by hand. So that's something we're going to build into this. We do the same with Oracle. We can replicate MySQL to Oracle, Oracle to MySQL. We also do quite a lot of Oracle to Oracle replication now. And at the moment, we're configuring those by hand. So that's all going to be puppetised as well. And we do sometimes Postgres. Yeah, we used to do an awful lot with Postgres and we had problems replicating from it. A lot of our customers moved away from it, so we stopped supporting it. But it's something we're going to get back into probably when we find a customer that needs it will sort of brush off the code and start working with Postgres again. I'm doing an awful lot of deployments at the moment on VMware and private cloud, so I want to expand this to work properly with VMware. So some very painfully learnt lessons, which we've learnt over the last year in the cloud. EC2 instances just fail. Get used to it. One of anything is never enough. So I have two servers. I had a customer complain in the other day that I wanted to install two servers. Then I pointed out to him that with reserved instances it was going to cost him $20 a month extra. He'd probably put it on his credit card and expense it somewhere. Don't assume you'll have resources available when you want them. I've actually got, for a lot of our customers, I've got instances already sitting, waiting, running, if it's business critical to them. And we'll just fail over to them when we need them. Think more than one cloud provider wherever possible. Don't put all your eggs in Amazon's basket. One day Amazon's going to have a big outage or Rackspace will have a big outage, or HP Cloud. You need to assume it may happen. Resources are disposable. Just throw them away. Key thing for us was monitoring. You need to make sure we monitor everything now. Automation. We spent probably the last year deploying clouds and then manually configuring servers and it started to get very time consuming, which is why all the puppet stuff has come around. Back up everything. Any questions? No. Enjoy probably what's left of the day. I think there's a party tonight. This is the one event I actually like because I actually live only an hour away from here. So it makes a change. That's our contact details if anyone's interested. And enjoy the rest of the conference. We normally monitor centrally. Yeah, generally if we lose contact with them, we'll assume they've gone and promote something else. We generally find in the web stuff, in the cloud environment we don't try and fix things as much as we used to. I think that's the right way and fail over somewhere else. That's probably one of the biggest lessons we've learnt. Don't spend an afternoon trying to fix something.