 Today I'm going to talk about the lessons, the lessons I learned in moving a service from physical host like bare metal to mesos. So I'll give a brief overview of what mesos is, what the benefits of mesos are, how I did the migration and what are the lessons I learned after the migration was done. So this is how most of the operations team run a cluster today. They start out with a bunch of machines or virtual machines and then they statically provision the machines out to certain roles, for example frontend, frontend, cash, Hadoop. And so this is something called static partitioning. So you assign a set of machines to a static role and they run a static set of services. For example, the frontend machine is always going to run Apache or Rails application. So there are some problems with this. One of the biggest problems is that this leads to unequal load distribution on the machines. So for example, if you have your frontend servers, you might have it busy running with high traffic from 8 a.m. to 12 p.m. And you have your cash servers which are just using memory, but not a lot of CPU. And it's not easy for you to move, even though you have extra CPU on your frontend servers, it's not easy to reassign that CPU to your Hadoop jobs or your Rails application or your services. It's also slower to add capacity to this setup. For example, if you are expecting a higher spike in traffic for, like, say, New Year's Eve or Diwali event, you cannot and you want to have a 10% additional capacity to be ready for that event. So you end up bringing in, let's say, two or three more machines, plugging it in, setting up the hardware, getting it ready, which is adding additional capacity becomes a matter of days instead of an hour. Maybe two or three days if you have it processed down really well instead of being a matter of minutes or hours. The other thing is that this static partitioning is also not fault. The meantime to recover from problems is higher, from issues is higher. So for example, if you have a rack outage, like you lose two or three of your web servers, someone needs, will get paged, he will come online and then bring back the servers, you know, and make them available for use. So is there a better way to, sorry, is there a better way to improve, is there a better way to improve this situation? So here, instead of thinking in terms of static partitioning or silos where a certain server can handle only, will run only rails or an Apache web server, is there a better way to say like, okay, this server is running at, is not being utilized at its peak capacity. Maybe I can run some additional services on top of it and do that dynamically. So do we, the question becomes like, do we think in terms of number of machines or do we think in terms of resources like CPU, memory and desk? So this is where Mesos comes in. So the Mesos is, Mesos allows you to treat all your machines in your data center as one big resource pool. If you want to run a service, or if you want to run, and by service I mean like any Apache daemon or, you know, like a cron job or a batch job. So if you want to run something, you request CPU resources from Mesos, CPU resources like CPU, memory, desk, and then you start up the service. Mesos takes care of finding where it needs to run, what are the constraints that needs to, that needs to match to the service, and it runs the, and starts up the service and monitors it. This Mesos also allows you to run different types of services like different type of workload on the same shared pool so you don't need to have like a specific architecture of machines to run like say Hadoop or another specific architecture to run your Rails application. Mesos also provides you failure detection. So for example, if you lose two or three of your front-end instances due to rack outage or power outage, Mesos can detect that failure and it can restart their services back up, and this reduces that mean time to recovery a lot. And again, even though we are running a multi-tenant workload like different types of services on the same side of machines, Mesos provides isolation of one service from the other. So what are the benefits of Mesos? So the biggest, the one which is touted the most is that it provides you a better resource utilization. You can run a multi-tenant workload on the machine. You don't need to assign specific machines to, or specific servers to run specific jobs. And this Mesos underlying uses containers instead of virtual machines. So this is much less resource hungry as compared to virtual machines. So the other things that it provides is fault tolerant. So if you have a job which fails, for example, if you have a memory leak and it runs out of memory or code dumps, Mesos can take care of restarting that job. It allows you to grow and shrink your jobs dynamically. So if you are trying to get ready for a spike in traffic, you can request for more resources. You can request for more instances in Mesos and spin up those services. We have also seen that our deploy time has gone down. So Mesos allows you to run multiple versions of the same package in the same environment. So you can have like version one and version two of the same package running in the same production environment and continue with your deploy without interfering with each other. It also provides easier rollbacks. So that has helped our deploy as well. So that was a brief overview of what Mesos is. So the next part is about the service I migrated from a standalone host to Mesos. The service that I moved is called Tico T.co. This is a URL shortening service that Twitter provides. So for example, if you tweet out a URL, it gets converted into a T.co link and that's what shows up on your tweet. This service was running on around 20 servers across multiple data centers. We migrated it to 20 jobs running on a shared pool of machines and these machines were shared with other teams. So this is a brief overview of how we did the migration. So one of the first things that we came to realize is that the service is now going to run on a container and the machine on which it's going to run is something which is out of our control. So we had to package everything up that we were going to need. So we could not assume any third-party libraries to be present on the machine on which it's going to run. We could not assume access to any system level directories. So the packaging part was a bit easy for us. This is a JVM service. So we could package the jar and package all the configuration files that we are going to need together in one single package. We also packaged any post install scripts that Tico was going to run when it ran inside Meso. So some teams have, instead of packaging their configuration with the package, what they have found is that they push their configurations to a data store like a key valued data store or a database and when they start up, they pull in the configurations at that point and start the service. So once we were able to package it, we deployed the service to a Meso's cluster. We started a few instances of this job in the Meso's cluster that we had and we started testing it. So the testing was done. We just stressed the cluster. We sent a bunch of traffic to the Meso's job and we compared the performance of the Meso's cluster against our production setup. So the metrics that we were chiefly looking at were the total queries per second that the cluster could handle, the average latency, the tail end latency, the tail end latency of the cluster. We also looked if there were, if we saw any codums happening or if the service would get restarted automatically or if the Meso's was restarting the service frequently. And then once we had tested this cluster, we were sure about the performance of the cluster. We started sending traffic from our load balancer, like from our production traffic to our Meso's cluster. We were pretty conservative in how we send the traffic. We moved it in steps of like 1%, 2%, then like 10%, 50%, 100%. So yeah, again, even when we send the traffic, like even when we were completely running on Meso's, we were very careful about what the metrics were. Like we kept an eye on what the latency numbers were and you know, if the service was showing any codums. We also looked at our user reported problems. If users were reporting any broken links or things like that on Twitter. So yeah, that's how the migration was done. So how did the benefits, did we have the benefits that we have promised on from Meso's? Did we get it? We did. We moved from running like 20 dedicated boxes. We moved into running on a shared pool. The operating cost went down, like the operating cost of running a bare metal was split across multiple teams. So that brought down the operating cost of that. It also reduced the operational task that we had to do. Like you know, we didn't have to worry about like SSL upgrade or Glypsy upgrade that was just being handled by a dedicated Meso's site reliability team that Twitter has. We also did not have to worry about like network maintenance. So for example, if we knew that some switches were being upgraded in the data center, the machines to that switches could be shut down, the jobs would get migrated to servers that are not being touched and you know, like things will start running again without like without us being affected about it. This also increased like this definitely improved our deployers and that in turn improved the iteration speed that the engineering team has for Tico. So the engineering team were able to run like multiple versions of the same package, compare one version against the other, see how it works. You know, like we could run small experiments without affecting our users. So that is how we migrated Tico to Meso's. And now I'm going to talk about the issues that we saw once we did the migration, you know, like and the lessons I learned from that. So one of the things, the first thing that we saw would be like services, which you, which were using that Tico service would report back that they were seeing this sudden spike in latency. They would see like timeouts coming to from Tico service. And we were when initially we thought this were like intermittent issue, but this was happening again and again. So when we looked deeper into it, we came to realize that this was due to how Meso's does resource isolation. So Meso's depends on when it runs on you, when it runs on the next Meso's uses control groups like C groups for controlling access to the CPU. So when the job starts up, it requests certain amount of CPU cycles and Meso's allocates a certain amount of CPU cycles for a time slice. So you might get like 500 CPU cycles every 100 milliseconds. And if you are, if you use your the CPU cycles allocated to you for your in your time slice, like if you use them all up in like the first 10 milliseconds, the job is like frozen for the next 90 milliseconds. So in our case, what we saw was like we had not accounted for garbage collection like the CPU cycle going for allocated to garbage collection. We are not not calculated for that. Once we once we increase the CPU quota that we're requesting from Meso's at startup, and we added more instances for the Tico service, this problem cleared up. The next issue we saw came from capacity planning. So what we noticed is that the maximum, the maximum traffic, like the max QPS that we were expecting for from our cluster was lower than what we had expected. Like it was lower around 5 or 10% of than what we had expected. And so to come up with the initial capacity plan, we had used a very simplistic approach. We had, we had taken one Tico Tico node and we had stress test that is we had sent it a lot of traffic. And at the point at which it stopped serving request, we said like, okay, this is the maximum QPS that instance can handle. And then we just multiplied it by the n number of instances that we're running. So we had like 20 instances, so we said like 20 into 100. So the whole cluster should be able to handle 2000 queries per second. So obviously, this was not what happened. What we realized is that if this, we were running our service on a heterogeneous pool. So you had like a variant in CPU, even though, you know, like we were using like Intel 64, but there were different versions of the CPU would have different throughput. So we had to change how we do our capacity planning. What we ended up doing is to run a stress test against all our, all the different architecture that were there in the mesos pool. And then we used the lowest number that we had. We used that for our capacity planning for the next time. So the other thing that we realized is about service discovery. So service discovery is like, for example, if you have a PHP application or a Rails application, and that needs to connect to a memcache server. So in the world of static partitioning, you can, the application can just connect to a static host colon port connection, open up a connection on that host colon port, and assume that it's going to get a cache server at the other end. But when it comes to, when we started running service, the service in mesos, what we realized is that the host and the port, port is allocated dynamically. So for example, when a service starts up, one of the instances can start up on hostname A and port like, you know, 3000. On other instance, it can start up on like hostname B and port 3001. So most of the newer applications that we were writing had zookeeper discovery built in. So they would connect to a zookeeper server. They would query the zookeeper server to get the list of host and port and then use that to connect to. But we had a few legacy applications that did not have this capability. So what we ended up doing for them is to set up a few static proxies. This static proxies could talk to, could query zookeeper, get a list of the server and host, sorry, the host and port list from zookeeper and then the legacy application would connect to the static services, you know, like on static ports and the proxies would forward the connections from those ports to dynamically to the services in mesos. So, you know, like, even though, so if you have a legacy application, this might be something you might have to keep doing for a while. The teams at Airbnb and telepard, they have released the packages, the way they have solved this for themselves, they have released that as open source. So Airbnb has released a package called Synapse, which does this using HF proxy. There's another company called telepard. They have released a package called or proxy, which does something similar. The last lesson that we saw would be, again, we had services come and talk to us that they would see this increase in latency. And this was not being caused by job being throttled. Like, we had enough CPU jobs were not being, the services were, our service was not being throttled by this and we had to investigate what's going on. And when we investigated this, it came down to a very simple problem that we had a few noisy neighbors. Like, if you have run services in AWS, you will know about this. So if your process gets co-located with another process, which does a lot of disk IOPS, and you are yourself doing synchronous writes to the disk, your process might get blocked because of the other process. So once we realize this problem, what we ended up doing is that we moved our writes wherever we could to be asynchronous so that we would not get blocked by another process. However, this is not possible when it comes to network. If you have a, if you have, if you get co-located with another process which does a lot of, which accepts a lot of network traffic, incoming network traffic, you can still run into this problem. So, yeah, even though we may source users, Linux control groups for this, it's still not, sometimes you still have the problem of a noisy neighbor affecting your own process. It was all I had. And if you have any questions. Hello. Hello. My question is, can we say mesos is an alternative of Docker? So Docker is a container. Mesos users can use Docker to provide the isolation. So, you know, like, if you have, if you have, if you want to run Apache application, it can, you can create a Docker image for this and Mesos can make sure that it starts up on the machine on, you know, like, which has resources to run the job. If it, if it has, if this container shows problems, it will restart, take care of stopping it, restarting it and enforcing the quotas for that. But yeah, it uses Docker, but it is not in place for Docker. Okay. So it will provide an isolation of the application. It provides scheduling and isolation. So, yeah. And how mesos, like, high spikes in CPUs can be, like, easily and not be meet out the problem we face in the highest spikes in CPU. Can we meet out by using of mesos? So it can, you mean to say, like, if you're process, if you are running a process in mesos and if it has spikes in CPU, what can you do to protect against that? Yes. So it's a trial and error learning process. There is no, like, there is no definite formula for that as such. What we have seen is that there are, like, the Linux control group has good monitoring. You know, like, it has, it provides enough metrics. So if, if your job is getting throttled, you can monitor it and, and see, like, okay, what was the throttling and, you know, like, you can then provision your CPU to handle that throttling. But it's not, like, from day one, you cannot expect that your jobs will not get throttled. So it's better to plan on keeping an eye on metrics and, you know, like, increasing the CPU or increasing the number of instances, iteratively, if you see throttling. Hello. Hello. Hello. Hello. Can you hear me? Yeah. So I have, like, two, three questions for you. So first thing you said about Hello. Yeah, we're only allowing one question per person. Go on. It's not fair. A lot of people have questions. So you can actually tweet out to me and I can respond and I can take. Okay. So I'll just make it one short thing. So on the noisy, noisy neighbors that you talked about, right? In, in our setup, we have seen MISOs always over provisioning the apps. Like, even though the particular slave has only eight CPUs, right? The tasks that gets allocated, the cumulative sum of the resource requirements of the tasks that get allocated to the slaves is generally more than eight. Have you seen such sort of an issues causing, when you have to scale up your apps suddenly, things like that? That's a good point. I, I, I don't know from the top of my head if the MISOs site reliability team has seen this issue, but we, we aim for, we, we aim for oversubcription. We keep an eye on the CPU and the number of jobs we are running and how much CPU we have, like how much resources we have left. We keep an eye on that. So we try to stay ahead of the curve in, in that case. So, yeah. Hello. Hello. Am I already? Yeah. I want to know one thing about the networking part. If you don't mind, can you, I don't. I want to know something about the networking part. So how do the container networking works? And one more thing is, you know, we might have known about MISOs here. Right. Okay. So can you throw some light? Like how you do auto scaling and de-scaling using MISOs or MISOs here can be a better solution then? So to answer your first question, MISOs uses like container C groups for the network isolation. One of the best documentation that I have seen for C groups is from Red Hat. So if you go to the Red Hat documentation and see, if you see the network networking part for that, that will give you a more. I'm not asking about contention. I'm not asking about contention. Right. But docker networking means like you have multiple containers on the same post. Right. Right. You do an adding to the port, but how do you actually connect, you know, how does that auto discovery works in connecting to the exact container port from an outside world? Oh, so, so that is where you have, you know, like service discovery come in. So the port that you get assigned is, is dynamic. So you cannot assume that you are going to get the port. So what MISOs also does is that it talks to Zookeeper. So whenever your service starts up, you can say, you can announce your service under Zookeeper path and anyone who wants to, you know, like connect to your service, they can connect to that. Like they can query that Zookeeper path and it will get back in the adjacent field with, you know, like host colon port, host colon port and use that to connect it. So if you use finagle, finagle is a library from Twitter for networking. It's a Java, Java Scala based library. So you, that has service discovery using Zookeeper built in. The, otherwise you might have to use like or proxy or What you're saying is basically whenever a container comes up, it will register itself with the service discovery and then MISOs does handle that part for you. The second question was about auto scaling and descaling. So I haven't worked with AWS. So we have our own metrics, you know, like we monitor what are the number of jobs, how much, how much CPU are we using up and we plan accordingly. But yeah, I don't know about auto scaling and AWS, how that would work. Could you hear me? Yeah. So do we have the MISOs cluster which you're talking about, is there a central storage or something that's being maintained? If yes, what is the technology that's used for the central storage? Sorry, sorry, sorry, can you say that again, please? Is there a central? No, no, no, yeah. Okay, I think I understand your problem. So if you're running your job on MISOs, you should not assume that you will get the disk. Like you cannot assume the content of the disk at all. So if you need static content or something like that, it's better to package it with your MISOs package. Or else, at runtime, you can make calls to Hadoop, HDFS, and pull in the content or fetch it from the database. Or if you have it hosted somewhere and then you can pull it from that. Okay, right. Hello, how does MISOs compare with Docker plus Kubernetes? I think, so I haven't experimented with Kubernetes, but from what I understand, it's solving the same thing. The thing with MISOs is that Twitter has used it up to a lot of multiple data center, multiple data nodes, multiple nodes, and it's based on what the Google has for the bar. So it's based on that similar thinking. I don't know about Kubernetes and Docker, and I don't know which all are the big companies that are using it. So I cannot really comment. But with MISOs, it has helped us save our cost, keep our services running a lot. So that's, yeah. Yeah. So we break for tea now, but just a couple of announcements. Please sit through it. Thank you.