 In the next 50 minutes or so, I'd like to talk about, I will migrate from one Cloud provider to another. I am Michele Orselli. I work from Italy. And I work for a company which is called Ideato. We are a software agency, basically. So we help our clients turning their ideas into projects and product. These are my contacts. So feel free to get in touch with me after this presentation or by email, Twitter, whatever. So I'll start with a little story about how we took an 80-year-old code base, which started causing a lot of problems and wasting a lot of money for our clients and started showing increasing stability problems and problems. And how we migrated it to a more suitable and stable platform. And in the end, everyone was happy. Let's start from the beginning. As I said before, I'm Italian, so apologies for my bad accent. And if I start moving ends, in that way, that's perfectly fine. I'm fine. Don't worry about that. If I fall asleep or fall unconscious, OK, in that case, I'm not fine. Help me. But what do we Italians like? That's a pretty simple question. Want to try to answer? Spaghetti. Yeah. For sure. La pizza. Yeah. La mamma, mommy, maybe not that particularly warm. But yeah, all of these answers are perfectly fine. But there is another thing we like, we enjoy. Football, OK? You should know who this person is. Yeah? And yeah, our clients basically have a site talking about the transfer market. If the player X is going to play for Napoli, Naples, or Iguain moved from Naples to Juventus, and stuff like that. We got crazy about this stuff, OK? Our government can raise taxes light to us, but we're fine as long as we can watch football. In this particular case, traffic were not really, really big, but there were some important numbers. We experienced some peaks during the year, in particular in January, June, August, when the transfer market is held. And also, we have some spikes during particular matches, big matches, or Champions League matches, and stuff like that. All the infrastructure, all the sites, were hosted on racked space cloud sites. How many of you do you know about that? Have you ever heard about that? OK, a few. Good. Which in the cloud pyramid, we can put it here, OK? It is a platform as a service. What does it mean? Well, it means that you don't have to configure basically anything, almost anything. You push the code into the cloud, and you're done. So from the developer, from the ops point of view, you have a black box, OK? You just put the code. You actually can change a few parameters, but not that much, and you are done. Unfortunately, there are some kind of limitation, like some odd limits or on resource usage, like you cannot have more than 50 database connection, concurrent database connection. You can deploy only through FTP, which, well, in 2006, when we started immigration, was not really fine. And you deploy on a shared NFS folder directly, which also means that sometimes during deploys, the symphony, because I'll show you later, we are talking about all symphony application, the symphony cache got messed for, I don't know, no particular reason, and the site crashed. Good. Also, we didn't have access to real-time logs, like access log, slow query logs from SQL, and stuff like that. And we cannot SSH enter in the machine. Yeah, because, OK, it's a platform as a service. We are not supposed to do that kind of stuff. Also, we were still on PHP 5.3, which, again, was not that good. Let me spend a few words on all the sites that made up the platform. The first one was, is, actually, the website, which contains all the articles, all the categories, all the custom pages for the teams, basically everything, almost everything. And it was also the oldest site developed in symphony one. After that, we have mobile, which is a customized version, is a custom application for mobile devices. And again, we have all article, categories, teams, details in a different format. We didn't have any kind of responsive design, stuff like that. And it was a freestyle project, symphony two components, but no silks, no framework, just components put together. And, well, it is a small application, a bit messy. After that, we have a community site. Well, Vivo per lei in italiano means high live for her, which means for my team. It's a bit exaggerated, maybe, but it gets you the idea. Basically, in this site, people can write their own content. If a team loses the last match, I can write here that I'm angry. And other people can talk with me, complaining and fighting, basically. So use generated content. OK. Again, it's a symphony two application. Then we have some sort of microservices for, well, particular task features. And when I say microservices, I have something like this in mind, micro. Not that micro, to be honest, OK. They were more micro services. The first one is talk. It's a set of API for dealing with comments, votes, ratings, and stuff like that. And it's used through all the application. This is a snapshot of the web application. So we can see here the user comments, votes, the capability to answer a comment or to post a new one. And the same goes for the mobile site. Here, we are seeing the number of the comments, votes, and stuff like that. Then we have ADV, which is a set of API for serving banners, basically. And again, they are used both in mobile site and in the website. Here is the mobile one. And here you have banners served in the website. Last but not least, media. Media is a services used to deal to all the assets, images, both user-generated and editorial images, and to allow to process this kind of images, like cropping, resizing, and doing that kind of operations. OK, again, same furniture. Here is an example of, actually, a picture of how all the parts of this platform interacts together. Basically, here we have the macro services, each with its own database. Then we have a community site, mobile site, and website, which are the public-facing sites. In front of that, we have Amazon CloudFront, which is a distributed cache for those of you who don't know. And yeah, this proxy, basically, is used to allow mobile sites to call these macro services. Mobile doesn't know about these services. Web, in that case, works also as an orchestrator. OK, so here is where we started. The first problem began started with the talk application, in particular. During some big matches or, well, during peaks of traffic, the application started to error. And basically, the user cannot post or read comments. And they started blaming us because they thought they were banning them in some sort of way or moderating them. So we tried to find a quick solution. The first one was, well, tuning the HTTP response headers and caching more endpoints and optimized queries. The first one was the quickest solution we came up. Then using the log from CloudFront, we found what were the most it endpoint, and we started caching. And basically, we started putting snippet of code like this in our application. OK, so this was a temporary solution, but it gave us some time to think about it and how to, well, came up with a more complete solution. And basically, we didn't think too much of a plan, a really detailed plan. But the idea was, yeah, we cannot use this infrastructure anymore. We need to migrate to another one, to another platform. And in particular, we need to migrate from a platform as a service to infrastructure as a service because we need more control on what's happening. So when something goes wrong, we can just try to understand what's wrong and fix it. OK, we evaluated a different solution, and in the end, we decided to migrate to Amazon Web Services because we got good experience with that and also part of the infrastructure was already on AWS, in particular, the Cloud from Distributions. As I said before, the first candidate for migration was the talk application. OK, so we laid out a possible architecture. I'm not sure you can read this, but I'm going to explain it for you. Here we have a Cloud from Distribution for starting a dynamic content caching. Then we have an elastic load balancer which accepts requests and decide to which web servers direct every request. And then we have two front-end servers in two different availability zones. Every front-end web server contains Nginx and PHP. And then we have the database on RDS, which is the relation database service for Amazon, and a main cache hosted on elastic cache. So basically, this was the first architecture we came up with, and we thought it was good enough to allow us to migrate. Moreover, we take this opportunity to bump some version of the packages up. In particular, PHP was migrated from 5.3 to 5.6, MySQL from 5.0 to 5.6, and we switch from Apache to Nginx and PHP FPM. So by doing that, we faced some problems. In particular, when you were trying to switch from one setup when basically you have only one machine because in the previous version or XPACE, well, maybe there were more machine, but everything was shared. So we can treat it only as if we're only one machine. Here, we have two machines. So we have different things to take care about. First one, web servers IP are dynamic because they are hosted on Amazon and they change. We can connect only through a bastion, and then we need to share user sessions between servers. The first one, web servers are dynamic, OK, but we can use AWS SDK to get a description of the servers and to filter that to get the IP. So before every deploy, we just fetch the available servers, which maybe is one, two, three, four, depending on how many are running based on the scaling policy, which I'll show you later. And basically, we're done. So we get the IP, and now we know where on which server we need to deploy. OK, this is a snippet on how to do that with the SDK. In particular, here, we are fetching these group of servers, and here, we are filtering the information in particular private IP address, OK? Make sense? It should be, OK. Please feel free to stop me and ask me a question if I'm not clear enough. OK, we can connect only through bastion host. What does it mean? As a security policy, basically, we didn't give direct access to the production machine, so a developer cannot SSH into the front-end server. But we need to connect to another machine, which is called bastion host, and this particular machine have the rights to connect to the front-end servers, which is OK from the security point of view, but it's not that good, and it can become complicated when you want to deploy new code. We solved that, adding a configuration on SSH. In particular, here, we are saying, OK, that they want to connect to the bastion with this particular IP address, which is a fixed one, using this key, which is pretty simple. After that, here, we are saying that if we want to connect to this pool of IP address, which are the IP of the front-end servers, we don't know exactly what they are, but they are in this range. If I want to connect to one of these IP, I need to proxy through the bastion. OK, so basically, this is the meaning of all this configuration. With this configuration, we can just SSH through the front-end servers once we know when we know the IP, and we don't need to do anything else. OK, last point, share user session. Actually, TOK was a stateless API, is a stateless API. So we didn't deal with that, at that particular moment of the migration. We'll be back on that later. Another thing we implemented was nginx static cache. Basically, nginx can create a static file after responding to a request and can serve this file instead of processing the request again, as would another proxy cache do, like maybe varnish. This is the configuration we used. In particular, this configuration is important. FastGGI cache key, basically, all these different parameters and all the different values that each parameter can have make a key unique. So basically, here we are saying, for every different value of scheme, request method, getPostPutPatch for a different host name and for a different request to cache a different version of the response. OK, in that case, we also want to cache the response for three minutes, which is quite good. And another configuration was to exclude some particular method because, yeah, we don't want to cache post or put method, also delete. And we set this parameter, noCache equals to 1, and we are done. Another case for the talk API, this endpoint, API queue is the moderation queue, where the comments get posted and the way to be moderated by an editor. Also, we don't want to cache this endpoint in order to show the editors always new comments and non-still ones. What else? OK, here we are seeing that the noCache parameter used to bypass the cache in this way. OK, this is pretty standard, I guess. We deployed using the same link trick. Let me call it that way. Basically, every deploy is done in a different directory, and then we have a sim link, which is switched at the end of the deploy procedure in order to have automatic opcache cleared. After each deploy, we use basically this parameter. We set document root to real path root, which means that sim link get resolved. And when we deploy to a new directory and request start arriving in the new directory, a new opcache is created. Again, so there is a little trick to have opcache cleared automatically. OK, so all the setup was done. We took the old logs, which were available, not real time, but available, and we tried to find out which were the most hit-and-point, and we start to load testing them. After that, we created MEI images, snapshot of the machine. We deployed the latest version of the code on the new servers, and after that, we switched DNS. And we were done. More or less, actually, because everything exploded. And we found out that the database were missing a particular index that weren't present in the rack space database. And here, the problem became quite obvious after, like, 10 seconds. So the idea of moving from a platform as a service to infrastructure as a service was good from that point of view. OK, first part of application was migrating. We were wrapping. Woo-hoo, quick win. The platform now runs on two clouds, Rackspace and AWS. Two quick words about backup. Remember, this is our architecture, and we perform the backup of the front-end machines and the database. We create a snapshot every night, and we don't have a multi-region setup. All of this is in a single region. So if the region fail, we are in trouble. But we just copy the snapshot to another region. So that was something we discussed with the client, and we decided it was good enough as a starting point. OK. How can we do that using the AWS API? Here is an example of how to create a snapshot of the database, OK, and how to copy it to another region in this way, RDS, CopyDB Snapshot, and then you specify the source and the target and the region. And you're done. After that, we clean the previously created snapshot, and basically we are done. OK, second application, ADV. Very quick, because it contains only a small database, really, really small. It's, again, stateless. All the infrastructure was already set. The immigration was particularly easy. The only thing we did was to create a new cloud from distribution for dynamic content called edv.cultramarkato.com, OK, in order to benefit from, again, HTTP caching. Done. No particular problems. Mobile. OK, the mobile part of the site, well, had a night impact because almost half of the traffic runs through the mobile site. Now, this was when we started immigration. Today is higher. I think it's 55% of the total traffic runs through mobile. There is no database because it consumes data, APIs, from the other services. We needed to face a new problem, how to deal with static assets, OK? When we deployed the mobile site, we have static assets like background images and stuff like that. And in the previous version, Rackspace, everything was shared through NFS. We decided to use the object storage from Amazon, Amazon S3, defining a bucket in this way. And we created two cloud for distribution, again, one for dynamic content and one for static content, again, for HTTP caching. And when we deploy a new version of the site, using this command, we sync the asset on S3. Here we're saying, just sync the data from this local folder directory to this bucket. And again, we are done. We deployed the mobile site on a sample machine in order to make some tests. Again, we, using the logs, tried to figure out which were the most it pages and performance tested it. We deployed, rebuilt the MEI, and switched to DNS. And everything went smooth. We measured after the deploy things for a bit, like one week, two weeks. And we started to see that there were a lot of requests from CloudFront to S3. It turned out that we missed some header, which can be added to every static asset in this way. Here, basically, we are saying, I add this cache control max age, like one hour, to all these assets, which is nice because this means less request from CloudFront to S3, which means less dollars you spend on that. Again, still makes sense. Another part was community. As I said before, this site allows users to create their own personal blog. I can write some posts, and people can comment and vote my posts and stuff like that. And it's the first app that can be considered complete because it interacts with the database. It uses some APIs and stuff like that. Also, it exposed some APIs for user-related stuff, like single sign-on, or allowed to retrieving user posts and stuff like that. So the problem was kept before, popped back again. We need to share user session between servers. In this case, we used another service from Amazon, Amazon Elastic Cache, which is a main cache as a service. You can use without need to configure it on your own machine. We didn't have to make big changes in the code, just defining a new services, which is main cache, adding the different hosts, and change the session and learn in this way. And automatically, all the session now are saved and retrieved on main cache rather than the local file system. Yeah, exactly. OK, how to deal with user-generated content? We found a way to deal with static assets, and now we need to deal with user-generated content like post-hand images and stuff like that. Again, we created, like before, two particularly cloud-food distribution, one for dynamic content, called VXL, and one for static content, called CDN VXL, both dot-cache-mercato.com. OK, in order to work with user-generated asset, we used GoFret, which is particularly, I guess, famous. How many of you do you know GoFret? A few? OK, it's a file system abstraction layer, which it means that it allows you, once you configure it, to write code in a file system-independent way. It doesn't matter if I want to save a file on S3 or through FTP or in the local file system, the code remains the same. I just say, file system, write this content. And the library takes care of all the work needs to be done underneath to make it happen. Another good library to do that is Fly System. If I remember the name correctly, from the PHP League, also quite good. How do we configure GoFret in symphony? You need to define adapters. An adapter represents basically, well, one file system you want to be able to write a read to. In this case, we are saying, OK, this adapter is called PhotoStorage, and it uses Amazon S3, which the bucket name defined in this parameter and uses this service ID. Then you define a file system called, again, the same way. Maybe it's not really clever name, but here we are saying the PhotoStorage file system who uses this adapter. And we can also call it, in that way, PhotoStorage file system. And the configuration is done. The client, the service we passed before and the configuration is basically an S3 client, which we create using, well, yeah, AWS SDK. Here we are saying, create an S3 client with this key, this secret in this region. The key and the secret and the one you get when you sign up for Amazon Web Services. The code to write and read assets is, as you can see, well, file system independent. Here we are saying, this file system write on this path this content. And we don't know if underneath we are writing on the local file system or on S3 or on FTP, whatever. So this is the important part. And by the way, the photo uploader class is the one we use to deal with user-generated content when a user uploads his own images. Again, we deployed on the sample machine. We performance tested it based, again, on the logs. We deployed the new code, revealed all the images, the MEIs. And we copy all the user assets on S3 and with which DNS. OK, we are at that point at 2 third of immigration. We migrated for services to our left. The next one is web. OK, web was and is, actually, the oldest and biggest code base, Symphony 1, really, a lot of modules applications. And it's a proxy for all the mobile calls, API, actually. And well, as I said before, now this percentage is a little lower, but still, there is a lot of traffic running through that. First issues, well, officially PHP 5.6 is not supported by Symphony 1. We came up with two plans, two different plans. First one, try to upgrade Symphony 1 to support PHP 5.6. And as a backup plan, deploy the web on different machines with and all the PHP version. Fortunately, we found this project, which is basically a fork of the Symphony 1 repository, which is not maintained anymore. This one is actively maintained. And also, there are new features. For instance, dependency injection container and SwithMailer as a default mailer. Actually, we weren't interested in getting new features, so we just picked up the part of the code that enable us to make Symphony 1 PHP 5.6 compatible. It turned out that there were just basically one small change. In particular, this function, pregreplace, doesn't allow this switch slash e anymore. And basically, this is the change you need to do to obtain the same behavior. And that's all. OK, after that, we created another CloudForm distribution for starting content, cdmweb.calcimercata.com. And also, we switched DNS using Route 53 because, well, there is a detailed explanation on why we needed to do that here. The idea is, well, among the other things, it gives us more flexibility in managing DNS in that particular way for this particular application. OK, what else? Nothing more, actually. We deployed again on a sample machine, perform untested, deployed like a cata, repeating the same action again and again and again, repeating the MEI and then switched DNS. nweb was migrated. The last part was media, asset management, cropping, residing, stuff like that. As I said before, Android's image thumbnail league, stuff like that, pretty big archive, 17 gigs. And, yeah, API are stateful in that case because in order to perform this operation, we need to be authenticated. This is an example of how this part of the code works. Basically, the idea is when a site, mobile, web, or community asks for an image through media, media checks if a thumbnail already exists for that particular image. And if it exists, it service directly. Otherwise, it create it and upload it to S3. Basically, what we do here is, let's say we are serving a request for a new thumbnail. Download from File Manager. File Manager is the part of the media API that contains the images, original images. We generate a unique CDN key. We resize the image. We optimize it to save some space. Then we upload it to the CDN. We update some metadata. After that, we delete the temporary files. And, basically, we're done. So these are all the steps involved in creating a new thumbnail for a given image. OK. We needed to transfer all the old images from the Rackspace CDN to S3. As I said before, the only access we got to Rackspace was through FTP. Yay. So we wrote a nice script to copy all the images. Well, it's basically a long, boring task. And sometimes, obviously, the script and or the FTP connection drop. But after a few tries, we made it. And we copied all the images on S3. And we found another little problem. In particular, there is a little difference on how S3 can manage image name. In particular, you need to remove all the strange characters, which in the previous CDN was not a problem, but it was. So with this snippet of code, we created a slug of the image name. And, well, after that, everything went fine. OK. So we migrated all the application. There are a few more things I'd like to discuss. And then, if you have a question, I am available. Monitoring, yeah. Well, because we want to know when things go bad and in order to be able to react timely, we used Amazon CloudWatch to do that. In particular, we traced different metrics like RDS CPU usage, IO operation on the database again, the main cache dimension request, and stuff like that. The number of requests on the instances and the CPU usage and stuff like that. Since there is a two-week retention for this information, we have integrated it with our internal CACTI and Nagios monitoring tools in order to have a longer retention of this data. And also, we integrated it with Slack. So when something goes wrong, we are not defined directly on Slack. Autoscaling. One of the nicer things of using Amazon Web Services is you can decide, based on some rules, when adding or removing more servers. For our use case, we have two autoscaling group in our setup, and we defined three different metrics. CPU usage, response time, and number of requests per second. In particular, we decided that when the CPU usage goes over 70% or when the response time goes over 100 milliseconds or when the number of requests per second goes over 10,000, we spin up a new machine. And when these three metrics come back to normality, we just drop the extra machines. In this way, we can handle, in a nice way, peaks during, let's say, January, August, September, and also peaks during big matches. OK, there is some few things I'd like to say. The fuel migration took us one year. Actually, it wasn't a full-time job, because it was just a client of ours. But yeah, it took one year of time. And it's something we did gradually, so we didn't want to rush this thing. We migrate one part, measure for one week, two weeks how things were going, and then plan and migrate the other part of the application on the infrastructure, sorry. And after some measurements, we found that we have a 50% cost reduction for our particular case, which is, well, quite good. Let me be clear, this talk is not about blaming Rackspace. Simply, it wasn't suitable anymore for our needs. Maybe it was fine when the project started, but at that particular scale with our particular needs, it wasn't suitable anymore. OK, so this is the final architecture. Here we have, today, normally, two front-end web service for different availability zone. And again, RDS, main cache. On top of that, a load balancer and all the cloud from distribution. DNS are managed through Route 53. And all the assets, both user-generated and static assets, are on S3. OK, what else? Macro services helped us a lot, because, well, they were macro, but not big enough to, not big as, let's say, web or community. And it allowed us to migrate in an incremental way. If we, for instance, had only a big monolith, well, probably it would have been much more difficult to migrate. Also, the HTTP cache helped us a lot, giving us time in the beginning, and helping us to saving money, reducing the number of requests from CloudFront to S3, but also from CloudFront to the front-end machine. It's important to try to measure when you do changes to see if your changes are having a good, positive effect or a negative one. Since the migration ended some months ago, we continued on that path. And we, well, started rationalizing the API calls, trying to make less calls in general. For instance, when loading comments in the previous version of the site, four or five requests were issued, if I remember correctly. Now, we have only one. We reorganized some front-end stuff, trying again to download less assets, using Sprite and stuff like that. And we started to get rid of the old Symphony 1 application, trying to migrate it incrementally again in a Symphony 2 one. The next step are upgrading to PHP 7, which is something we actually already started to do. And after that, one thing we might want to try, because it looks promising, is switching the load balancer from ELB to ALB, application load balancer, which supports several new features. One of these is HTTP 2 support, which looks really, really promising from the performance point of view. OK, I'm done. One last thing. If you want to join us in Verona, in May, this is Verona, there is a really, really nice conference, which is PHP Day. And the days before PHP Day, there is also the JavaScript Day. There are two community conference international, actually, so there are a lot of international speakers. Sorry. Sorry about that. Well, if you want to join us, it will be fun. Again, thank you for listening. These are my contacts. This is the talk on joined in. And if you have a question, I'm here. Thank you. Some question? Yeah. It sounds like the talk application was the one with the most volatile data, like users changing it, adding blogs, that kind of stuff. Was that the most difficult one to migrate, because you couldn't control when comments were being added and things like that? How did you ensure that you switched over without losing data? How to switch without losing data? Yeah, the talk one, in particular. Yeah, the talk one. Basically, when we started the migration, we put the API in read-only mode. So for a short period of time, people cannot post new comments, only read. And well, after the immigration, we restored the read-write mode. And basically, we were done. Also, we planned the migration in early in the morning when the traffic is usually low, so we didn't experience a lot of problems. Still, some users got angry, even if we put a message on all the sites, we are going to perform maintenance, but we didn't lose data. Another question? Yeah? OK, so first, thank you for a great presentation. Thank you. And second, you told that you are using CloudTrend. And in CloudTrend, you are able to define different origins, but also something called behaviors. That based on path, you are able to send the traffic to another origin. And when you have two or three behaviors, it's quite easy. But when a number of behaviors increases to, I don't know, 20 or more, it's a hell to deploy the changes. So my question is, do you have some ways of automating CloudTrend deployment, CloudTrend changes in CloudTrend configuration? No. We have some of those behaviors you are mentioning. For instance, for dealing with some path on the mobile site, but they are not that much. If I remember correctly, they are like four or five different behaviors. And no, we don't have an automatic way to change that or to deploy that. Other questions? OK. Thank you, everyone, for listening.