 The first thing I'd like to do is apologize for the sensationalist title for my presentation. I have an alternate title, but I'm afraid it kind of lacks a little bit of pizzazz. But this is what I'll be talking about, what we learned about the cloud by building and scaling a corporate intranet. So I realize cloud is a bit of a buzzword, but all I mean by it is the idea of having external infrastructure and easy commissioning of new servers. So for example, if you need video encoding in your app, you don't have to build that yourself. You can offload that to something, a service like Panda, and easy commissioning instead of taking days or weeks to get a server into your server room. You can have a computer responding to pings in no time at all, and both of these things lead to less maintenance and allow you to focus on what makes your app different. But sometimes deploying apps to the cloud can bite you. So deploying a web app to the cloud or Rails app, you have to deal with the normal scaling problems of a Rails app. There's nothing different there. But a service-oriented solution in the clouds can often mean that you have to deal with the whole new class of problems around distributed services. And those are what I'll be talking about. So the structure of the talk will be basically talking about the growing pains of corporate intranet we built through alpha, beta, and final release stages, and just talk a bit about where we might be going with it in the future. So it's broken into four parts, and at the end of each part, I'll just talk about a lesson we learned along the way, and these kind of form general rules of thumb for distributed web applications. So this is a phrase borrowed from the Perl community, Tim Todi. There's more than one way to do it, and I'm sure a lot of you have built distributed web apps, and you've come across solutions to problems and different problems, and I'm going to try and compile a checklist or a best practices PDF and release it to the community to just kind of things to look for when you're building your web app and putting it in the cloud. So I'm looking for something like this, Tim Todi bicarbonate, which according to Wikipedia is a related acronym, which is there's more than one way to do it, but sometimes consistency is not a bad thing either. So if you can help bring a bit of consistency to this, you can contact me on Twitter or at that email address. Okay, so Meet Mint, the corporate internet, we started building this a little over a year and a half ago for a client. It's just basically an internet, the employees log into it, you can see the latest internal news, they can download documents that they need, it's also got a bunch of other functionality, a forum that they can talk to the CEO, things like requesting annual leave. So at the start of this, we knew that it was going to be hosted in the cloud, so I'll just give you a quick system diagram sketch to show you how this is a distributed app. So the internet itself, which is a Rails 2.3 app hosted on Heroku, as I said, it pulls in from a number of different services, our clients use Google apps, so they have information stored in Google contacts and Google docs that the internet needs to run. Also, they need information from the HR server, the HR server is a machine that sits inside their firewall and it's got a third party API on it that we don't control, so that's why I've colored it a different color here. But the HR server talks soap and we wanted the internet to talk to a restful service so we developed an HR interface, which sits between the two, mostly this just converts soap to rest, but it does have a couple of other responsibilities as well. So HR interface is an app deployed on an EC2 instance. Finally as I mentioned, our client uses Google apps, they use Google sites for a number of different things and they wanted the internet to be on that as well. So we had to come up with a way of using our knowledge as Rails developers to build a Rails app that we could serve content to a Google site page and the way we did that was using what Google called gadgets, so basically there can be JavaScript and HTML components that run on the page or they can just be an iframe that loads content from another location and that's what we did. So the Google site internet page is just a bunch of iframes that display views from our Rails app running on Heroku. Just one final bit of complication to the matter there was that we had to authenticate the user who had logged into Google sites so we had to do an open ID authentication between the internet and Google sites. So this is, it's not a very big app but it does have a couple of services that it depends on and we knew from the start that we were going to have to build it to scale so we did all the normal scaling things you do with a Rails app, eliminating N plus one queries and just speeding up database access in general, page fragment and action caching and offloading anything we didn't need into a background job. So all these things were just basically about speeding up the request cycle and there was other best practices, metrics, notifications by Hoptoad and monitoring via New Relic. So we hacked on this for a few weeks, a few months and we were ready to get some early feedback from users. So we bought eight actual employees into the office to fire up Mint for the first time. They went into the boardroom and opened up the laptops and hit Mint and we were looking at New Relic and the load just hit the roof and was just unable to cope. So Heroku has the concept of dinos which are web processes and you can spin up a number of dinos to basically horizontally scale and we had eight users in the boardroom and we ended up having to turn on 24 dinos just to make it performant. So that's probably not very scalable. The reason for this we quickly realized was, as I mentioned, a Google site internet page was just a bunch of iframes on the page. So the homepage had about 10 or 12 iframes and that was 10 or 12 separate requests to the internet every time someone opened the homepage. So it was much larger than we thought but it's still not huge. So this was compounded by the open ID authentication that I mentioned. Every single one of our requests was doing open ID authentication every separate iframe. Those open ID requests were taking in the range of about four or five seconds each. So we had each user making about 12 or five second requests and eight of those users doing that all at once so it's no wonder that the app couldn't cope with that. So this was fixed quite simply with the idea of Ajax polling. Instead of doing these 10 separate open ID requests we gave the responsibility to one of the iframes on the page to do that heavy open ID request and the others did a very light Ajax poll to the server to see if someone had logged in, if there was a user in the session. And that leads to just the first very general rule of thumb with distributed apps is just to reduce the number of HTTP calls. They're very expensive and you need to do this. So we survived alpha testing and we were able to launch it to a few beta users across the country and we quickly saw that annual leave was one of the most used parts of the site. The basic idea of annual leave historically for our client they used to have to pick up the phone, call the HR department, find out how many days they had left, ask could they have a certain week in July off. The HR department would send an email to a manager for confirmation and if there was conflicts with the date it all had to carry on like that for a few days. So it's no surprise that when we automated this by providing all that information and the possibility to request annual leave automatically from the web page that it was going to be popular. So this all worked away fine for a couple of weeks but then we, so sorry this is just the part of the system diagram that shows the communication with the HR server. So if we want to retrieve any information we had to go down to the HR server and pull it back to the internet and submitting annual leave request goes from the internet down. Yes, so this was working fine for a week or two and then one afternoon we suddenly saw a load of hop-tode requests in our hop-tode notifications in our inbox turned out that the HR server at the bottom here had upgraded and changed their API. So it was quite a simple change, they just took two fields and switched them but obviously we didn't know how to handle this, we had a regular expression in our code but from the start we had decided that we'd wrap all our API, external API queries in wrappers. So luckily when the HR server changed we only had to go into our wrapper which is just in our lib directory in Rails and fix the regular expression there and it worked obviously across the rest of the site. So that's just the general engineering principle of separating the things that change. But there was one other problem here. Each of those hop-tode notifications was a live user who had seen an error message on their screen. So we didn't really want users to experience an error before we found out about it. So we decided to develop a set of integration tests. These integration tests basically queried the HR API, ran a bunch of assertions on the code every night. If anything was out of place, if the format was different or something was missing or there were significant changes in the number of employees in the feed, for example, we knew something was up, we'd have the email in our inbox at 8 o'clock in the morning and we could fix that and deploy it possibly before a user would even find out about it. So this brings me to the second point about developing for service-oriented apps and that's just to be prepared for your APIs to change. They will change. So code in such a way that fixing that will be easy. So after we fixed a few problems, we're ready to roll out the internet to a wider audience. And along with annual leave, we saw this page was quite popular. This page was basically a directory of contact information for everyone across the organization that had nothing like this in the past. So this was something they obviously needed and we're very happy to use. People pages pulled in information from the HR API as well as Google APIs, both Google contacts and Google docs. So we have the internet is making quite a number of HTTP calls here. It's not feasible if we're building this people page that we will, we have to make, say we're displaying 20 people on that people page. That means we have to make 20 HR calls and 20 calls to the Google APIs each. And that's just not feasible. Just one point I'd like to make here that if we had been developing the HR server internally, maybe we would have had the chance to build saner queries such as allowing us to get an aggregate. So we get asked for all information about these 20 users instead of asking individually for the 20 users data one at a time. But this was out of our control. So what we decided to do was just sync the data to our local database every hour. This isn't the ideal thing to do. The data can be an hour out of date. Some of the merging rules were quite complex. So this is probably the most tested area of our site. But with people data, it didn't really matter if your phone number was out of date by an hour, you'd ask HR to update it and it didn't update for an hour. So this is just simply the first rule of thumb I talked about, which was reducing the number of HTTP requests. We just did one sync in a background cron job on Heroku every hour and served the people pages from the local database. But we noticed that the hourly cron job was slowly beginning to, or sometimes was taking longer than an hour to finish and it was quite variable in how long it would take. And this often happened when the HR service was down or slow. So the net HTTP library in Ruby, which is what we were using to make the request, has a default time out of 60 seconds. So we figured if it's not going to respond in 60 seconds, probably not going to respond in maybe 10 seconds, possibly five. So we can reduce that. Here's some code to show you how you can do that. It's simply on the request, we just set the read and open time out. We have these set in global variables that are set from an environment variable. So that if we do need to change them, if some of the services are legitimately slower, we can just set them in the environment variable and restart the app. But what happens when your code times out? You can either choose to just let it time out and fail. But probably what you want to do is retry. But if the HR server, for example, is struggling under load, you don't want to blindly retry the connections. What you want is something like an exponential back off strategy. Basically, if it fails, try again in two seconds. If that fails, try again in eight or 16. And the time between retries gets exponentially larger until you cut it off at some point. The next piece of code isn't very nice, but I just thought it would be help in explaining exponential back off. Basically the idea is we keep track of the number of times we've tried and the max retries here, and we rescue an error in the get employee feed call. We do a check to see if we've reached the maximum number of retries. And if we haven't, we call the retry keyword after a sleep, which just takes you back to the begin and continues on. So it's just a naive implementation of exponential retry and would allow a server to recover. Obviously with the sleep statement there, it's not speeding things up if you want to do things in a concurrent manner. So as well as that, if Google contacts was down or not responding, we didn't want the whole sync to fail. So we'd isolate components, winners shouldn't stop everything. And also sometimes the idea was just to fail and go to the cache. This is very data dependent whether you can do this or not. But in our case with people pages, it didn't really matter if the data was three, four, maybe a day or two out of date. So you can just fail and go to the cache. So this leads to the third rule of thumb, which is to expect services to fail and make sure you code to catch exceptions and handle failures in the same way. So that was the internet rolled out to the entire company. We managed to, with these few fixes, we didn't have to run it on 24 dinos for all of our customers across the country. We could run it on far fewer. And we did experience many more problems related to the distributed nature of the application, but mostly they fell into those three fixes to reduce HTTP calls to handle change and handle failure. I just want to talk about the last some of the other considerations. These haven't been implemented in Mint yet, but these are things that we've been, that are on the books for future parts of the internet. The first one is HTTP caching. So any service oriented app needs to do some caching and in a restful environment, that's HTTP caching. This is done through three HTTP headers, MaxAge, LastModified, and eTag. And I'll just give you an example of how the eTag caching works. So we have the internet and it wants to get user information from the HR service and it just simply does get user 23. So the HR service loads user 23 out of the database and generates an eTag. I'm sure most of you are familiar with it, but if not, an eTag is just a hash of the user data that should change if the user data changes. So the HR service generates this eTag and sends it back along with the user 23 data where the internet will store the eTag. The reason it stores it is so later on down the line. When it's requesting user 23's data again, it can send the eTag along with the request. The HR service loads up user 23, generates the eTag again, and compares the eTag that was sent in in the request. If they match, there's no point sending all of user 23's data back over the wire again. So it just sends back a HTTP header with nobody saying 304, not modified. So with user data this probably doesn't make a lot of difference. The HR service still has to load the user from the database and starts to generate the eTag. But if you were talking about generating an expensive report that was maybe two or three megabytes in size, not having to send that over the wire is obviously a benefit. Here's some quick code to show you the HTTP caching on the service side. So this is in Rails, you have the expires in method. This just sets the max age header. And then that stale function you pass in an eTag and last modified parameter. It sees if the user is stale based on those parameters. If the user is fresh, that if statement doesn't continue, it just sends back 304, not modified. But if it is stale, it continues into the if statement and does the respond to block. You're not saving a whole lot here, but imagine in that if statement, if you had something quite expensive, you had to do some processing on the user object. So by checking if the user was stale before going into that, it could save you some processing time as well as bandwidth. And then this code, I haven't tested it. I just wrote this last night, so I wouldn't run it. But this is just to give you an idea of the client side code, so the intranet in our example. So if the intranet wants to get a user, it'll load the existing user from its database, and it'll check the max age compared to some time. And if that's the case, it won't bother making a HTTP request at all. If the max age hasn't been reached, if the user, or if the max age has been reached, we need to do the HTTP request. So we set the ETag header, which is if none match in HTTP, and the last modified, which is if modified since. We pass that in as the headers in the request. And then at the end of the method, if the response code is 304, we can just return that user. We don't have to do further processing. But if the user has been modified, we take the data that was sent back to us and update the attributes of the user. So that's HDB caching, there's also webhooks. These are quite a simple concept, I'll give a quick example. So again, we have the intranet and the HR service. And the intranet is requesting the HR service generate quite an expensive report, we'll say it needs to find out all sick days taken by all employees on Mondays and Fridays for the last 10 years. So it calls the generate report URL, but it passes in a callback URL, which is telling the HR service to generate the report and then just send the data back to Foo. So the HR service takes this, stores the callback and does the work on the report. So this could take a long time, it could take minutes or hours or days. But finally, when it's done, there's no open HTTP request there, so it doesn't matter how long it takes. And finally, when it's done, it can load the callback from where it stored it and send back your data to that URL. And that's webhooks. So HTTP caching and webhooks kind of demonstrate a fourth rule of thumb, which is to reduce time and cost of HTTP calls. So that's basically it. These are the four rules of thumb that we came up with for the distributed scaling side of building Mint. As I said, if any of you have other rules of thumb to add to this or any other code examples to share with me, please do. Just like to thank those people for working on Mint and for helping with the presentation. And finally, any questions? So it seems at the start, you had the early database optimizations, but then it was just HTTP afterwards. Yeah. Do you still have database issues or is it purely HTTP? So we do have database issues of it, but nothing that's causing the app enough pain that we're spending time to fix. In a couple of cases, we have the admins have the ability to generate reports, which are very database intensive. And these can cause a spike and tie up a dyno on Heroku for quite some time. So we've had ideas of maybe confederating apps, having the user facing functionality on one app and taking the admin functionality onto a separate app. And maybe having communication through some sort of queue or something similar. But as it is, it's not really causing pain. So those problems haven't gone away, but they're not as bad anymore. Sure. Could you give us a little detail about figuring out how much it's going to cost to run the app? So we knew the number of people that were going to run the app. So we had rough ideas of how much it would be used. But I think all those calculations that we did early on kind of went out the window as it was being used. And we used New Relic quite a bit to monitor load and up the number of dinos on Heroku as was needed. And we kind of modified that number over a period of weeks and months until we got to a level we were happy with. So we did try to predict this, but it turned out not to be very good predictions and we just reacted to it after the fact. Okay, thank you very much.