 Hi, my name is Chen Huey and I'm with EXO Group. To give you a little bit of background, EXO Group is thenaught.com, which is the largest wedding website in America. And I'm going to talk a little bit about our Mesa's journey of where we went from with Mesa's. So there's often this gap that you get one between the time that you make the decision to use a piece of software and when you get to use it. So I'm from New York and we have these trains and unlike here they don't have the fancy doors. There's a gap and so at this station what happens is this platform moves and fills in the gap. And so it's kind of what this presentation is about is how you fill that gap moving from hey I want to use Mesa's to getting it to production, play, perfect. So the two gap items I'm talking about is you need a way to provision your infrastructure. In this case we use AWS and the other item on the right is software, software being Mesa's. Okay, I'm trying not to hit the wrong button. So with DCOS, there's a couple of ways to do it, right? So one way to run Mesa's is to use DCOS and so the couple ways to do that is they provide a cloud formation template which I copied from their website what's on the right or there is an advanced installer. The problem with the cloud formation is a couple of things. If you want to run it in production you really need five masters, right? Because your quorum is three, so if you have a failure of one you need at least five. And if you're really going to have that level of redundancy you're going to want five, not three. And they only give you a formation template that lets you have three masters. So you have to go in and modify that at some point. Also, you know, it's for AWS which the guy before made a good point you can't get to AWS. And my demo is recorded because all my instances I can't get to from here. So the other way to do it is that they have an advanced installer which requires you to set up a bootstrap node and then bring up the machines. But the problem there is that it doesn't help with the first thing which is provisioning infrastructure. There's nothing to do that. So then we end up, the other way you can run it is to run what I call the NOAA masos which is not DCOS. Just masos out of the packages that you can download. So, you know, you can use, we use Debian based OS at Xover. So luckily they packaged those up and were able to use the package manager to install those. Or we combined that with a little bit of homegrown orchestration. There are some advantages to running DCOS that when we looked at it was that it's fully packaged and integrated. There's an AMI for it. So it's the supply it's like AMI, it's all built. You can just pull it down. It's in the marketplace. And there's command line interface, the DCI. Downside is that the cloud formation like I said before is not very friendly, right? You have to go in and modify it. If you want to run five masters, which I imagine you would, you have to go in and modify it. It installs all the components. So every single node has all the pieces. So it actually installs all the binaries for your master and slave, regardless of whether or not it's a master or slave. The advanced installer, like I said, doesn't cover how you get your infrastructure provision. And it's the upgrades or it's unclear how you do it with AMI, right? Do you have to destroy the instance and then replace them? That's not entirely clear whether or not you can do them in place easily. So in a way, using mesos is like cloning. And mainly because you have this AMI that's the same, that's identical image that they're using to deploy, right? Cloud formation, that's kind of how they're doing it. So the drawbacks of this approach is that meaning of cloud formation and AMI is that the last 10% of configuration is all done in cloud formation using data. Which is, if you're familiar with it, very difficult to debug. It's difficult to parameterize. It's a bit of a maintenance nightmare because you're basically mashing a shell script in the JSON. So it's a very big mess. The other thing is that small changes require you to remake the entire AMI. So how, like why we're doing this. So we ended up, through evaluation, deciding to use Ansible to configure the software. Mainly because it's pretty lightweight and kind of ideally suited to configure these things. Because with mesos, the configuration is all mostly in files. It's a whole bunch of configuration files that need to be set up. Like if you remember from the previous slide, it's just directory after directory of files. So Ansible is pretty good at doing that. The other thing is that it's essentially a batch file and the order of operation is guaranteed from bottom to top. And you can declare separate tasks for each component. So we were able to create separate tasks for each of the pieces. Like one for common, one for Docker, Chronos, Masters, and Slaves. That's how we're able to specifically only install the pieces that are on each node. The other thing that comes out of the Terraform, which Ansible uses, is the inventory that Ansible uses. So there's an output function in Terraform. And you can have it spit out all the IP addresses for your masteries and you do better. This is what user data is. And you see this big mess of bash that's in JSON. And this is a relatively simple piece of it that I took out. If you're trying to do more complicated things, it gets much, much more unreadable. This is what I was talking about when you, and we have separate tasks for each component. So there's nicely separated, you know, for Chronos, Docker, Mesa's DNS, Mesa's Masters, Slaves. We even have some exospecific stuff in First Run. We're able to leverage the Mesa's fear packages. Like I said, we use Ubuntu, WM-based, so we can install and upgrade using the package manager. Which I think was interestingly something that was brought up in previous presentation as being a pain point. And that is true, like, that's kind of one reason why we ended up using WM. But this, I ended up recording the demo. So I'll run it from back here and kind of show you the Terraform first. With Terraform, it's got this feature that allows you to put in an environment variable with capital T, F, var, and in the name of the variable. So that's kind of how I set up my access keys, which obviously I was going to record and throw up there for everyone to see. But so you can see in one step, you can just bring up all your infrastructure. In this case, it's bringing up three masters and one slave. This was, yeah, there was a thing that I can't figure out why you have to run it twice. It's been used to have it. But it has something to do with one of the dependencies not matching. But if you run it twice, it comes up. And so another good thing about Terraform that actually I didn't cover in the slides is that it also manages your state. So at the end of this, it's kind of off the screen. But you can see, not perfect, what comes out right. Like this, you end up feeding into Ansible. And let me run the other one for your, oh yeah, I need this virtual for Python. So that's what's going on there. So with Terraform, it manages the state. So if you end up modifying, let's say you want to go from five masters to seven masters, all you have to do is modify, and I'll show you why I get to go back, modify a parameter, change it to seven, reapply, do the Terraform apply, and it will bring up the additional master and not change anything else and then help the state. So it manages that state. You just have to keep those state files. It becomes something that you should check in and keep safe. But you can just see it runs through and the main takeaway from the demo besides, you know, it's just a lot of text is that it's literally two commands that you're executing and you can bring the whole thing up. And unfortunately, I can't show you where they are from here, but they are up at Amazon, a fully functional cluster in the course of, I think these recordings took about 10 minutes total. And only because of where I was, it was a little slow. Normally from our office, it's very fast. Private Docker registry. So we use Amazon ECR, which is a private Docker registry. If you use Docker private repos, you'd have the same issue in terms of having to distribute this Docker authorization token. So there is this authorization token that needs to go on to all your slaves where your containerizer is running in order for you to be able to log into these private Docker repos. So you need to do this. And so we ended up having to write some engineering work to do this. And it ends up refreshing the token since it expires. I think it expires every 12 hours. We just have that time and refresh every six, but you could do it every 12. So essentially, this is how it works. The blue is two Docker containers, and they're running and they're triggered by Kronos. So Kronos triggers it. And what it does is it's pretty simple. It just uses AWS CLI to log into ECR and then generating them off to it. Then it puts that off token into the S3 bucket. So that's why we create that S3 bucket in that Terraform template if you kind of saw what was going on there. From there, the other Docker container also runs after the one that generates the token. So this one runs. Pull the token from the bucket and then pushes it out to the slave. And this does it every six hours. And all of this is governed by Kronos. This is a config file which configures how to do this. It indicates the S3 bucket name, the role name. So instead of using keys, we use everything's role base. So the slaves have, they all assume that role. So then you don't need keys. That provides the permissions. And then it copies those files over to the slaves. The slaves are provided by that list from the metrics. Yeah, real simple. We're just using the AWS CLI to create the JSON. We don't really want to get into that business. It generates it nicely. Here's some code where we're doing the role assumption, including the files and kind of all of what that looks like. So where does leave us in terms of... So we went through, we configured all these things, we bring up the bases, we're able to do it in two commands. And an interesting thing happened. When we, when I got accepted for this talk, we also simultaneously made the decision at EXO not to use, end up using Mesos for our container cluster. And mainly, I'll kind of talk a little bit about the reasons why. Some of it has to do with the marathon interface. It's a little clunky, right? These are your two choices when you're deploying containers in using marathon, right? Like it's either this UI or you're sending this giant amount of JSON, which isn't the end of the world, but at the same time, our developers would now have to learn another interface, right? And it's kind of nothing really specifically against even Mesos because they would have to do the same thing with Kubernetes. We're kind of interested because of what Swarmkit has offered in terms of having one interface for developers to keep using that they use this interface on their desktop. And they would also use it to push their containers into production. Some of the things that we ended up not doing, but were potentially had design was auto scaling the instances. And you know, there's quite a bit of metrics that are available in order for you to make decisions on whether or not to increase or decrease your number of instances. And we'd use IAM roles to facilitate that. Again, the slaves wouldn't have EC2 policies that would allow them to create new instances. So in the end, we're not completely abandoning Mesos. We're ending up using DCOS to run Elasticsearch. Our current Elasticsearch, we're using the Amazon-managed service, which is a few versions old. So this gives us a nice way to bootstrap Elasticsearch. But for our containers, we're going to use Docker Swarm as that we're doing that on a limited basis for our global proxies and homepages, and we'll be expanding that as we build some additional tooling for that. Why? I guess there's, you know, we don't like the fact that there's a proprietary CLI. We also, as I mentioned, the awkward interface, or it's not, I guess, awkward's a bad word. It's really another interface that we'd like not to avoid having to support. Zookeepers' latency sensitivity, we find in our testing that it's very sensitive, or even across availability zones, so we would like to try to run our discovery, service discovery across regions, which we're able to do using console and Docker Swarm. So yeah, that's ultimately where we're ending up with DCOS to run Elasticsearch and Docker Swarm to run our containers. And I guess that's it. Thank you and any questions. I guess not. Have a great afternoon.