 Hello everyone, my name is Jay Chin. I'm part of the infrastructure engineering team at OpenTable where I lead the SRE team there. So just a bit of a background about me. I used to come from financial services. Anyone from financial services here? Yeah, one, two, yep, maybe, yeah. And when I came to OpenTable, it was sort of a culture shock for me. Back in the banks that I used to work with, so I worked with compute grids, lots of calls, running risk calculations overnight, a shared grid that's used by different teams in a bank, and whenever we introduced some sort of middleware, there was always resistance. Development teams were very highly resistant to change. I don't want to move to this new middleware. But it was the exact opposite at OpenTable when I joined OpenTable. I mean, I think it was in the second week when Team CEM came to me and said, I want to move to mesos. When can you help us do that? We want to make those code changes. We want to move to mesos immediately. And I was shocked and when I dug into it, it was not just the technology. Of course mesos is great, but it was the things that we built around mesos that sort of made the transition so much smoother for the development teams and everyone wanted to jump on to mesos. So I'm here today to share with you guys on what we did to make mesos greater in OpenTable and why our developers and engineers love mesos so much. Okay, so some numbers. I guess how many of you actually use OpenTable? One, two, three. It's not that big in Europe. It's quite big in North America. I mean, that's our biggest market. We're going to be 20 years next year. So it's not a startup. It's quite an old company. We've done about 1.4 billion online reservations. We do about 2.3 billion dinars per month. We have 58 million verified reviews on restaurants. So if you're not booking through OpenTable, it's also a great place to go to look for restaurant reviews. We have 43,000 restaurants globally. And then I think the stats from 2017 showed that 55% of all our reservations that we take come from mobile devices. Our business day is... can anyone guess what's the year business day? Valentine's Day, right? I mean, we get like 500 to 600 searches per second. Some stats on Valentine's Day. 43% of all the reservations on Valentine's Day was the week leading up to the holiday. So, you know, most people booked, you know, Valentine's Day is next week. Most people booked this week. Last year we had something really unique. The earliest reservation for 2016 Valentine's Day came on the 2015 Valentine's Day. You know, that's dedication, right? Maybe that guy had a bad experience booking a restaurant, but, you know... Anyway... So... What does our tax deck look like? Well, being an old company, we used to run everything before 2013 as a single single.net monolithic application. It was a shared code base. All the applications contributed to the shared code base. It was single tax deck across all our data centers globally. Every two, three months or so, when we do a release, it sort of looked like the NASA command center. All hands on deck, right? All the operations team, development teams, everyone was there because this was going to be a big release, right? We've worked two or three months putting those features. Let's do a big bank deployment now. You know, if things go wrong, maybe we can fix the year bugs, you know, make the release go well. So everyone was there looking at metrics and everything while we did the deployment. Sometimes things go well. Sometimes... Didn't go that well, right? So developer could be working on a feature for two or three months, and then if something went wrong, which was no fault of his own, right? Someone introduced a bug. His features had to be rolled back, and that's probably him there. All right. So what we did to solve this problem was around 2013, we moved to SOA, right? So we split up the big monolith into individual services. So as you can see, we have this big website, although it looks like a single website, there's different features of the sites there. For example, search, reviews, emails, all this was split into different services. And it was great because we told the developers, now split yourself into teams, use whatever technology you like, do whatever you want, feel free, go wow, right? And developers just went out there, you know, sort of looked at bleeding-edge technology. So we have .NET, we have Node.js, we even had Clojure. We had everything there because they were microservices, it was fine, right? They sort of didn't depend on a single shared code base. And then with this came independent release of features and services from a two-month release cycle. Now we had thousands of deployments to production a week. So product features could come out really, really fast. There was good iteration there and the product teams were really happy with what we did because now, you know, they could instead of a two-month release cycle for product features, it sort of shortened it down to one or two weeks. So which was great. And how did we host those services was through virtualization. So you can think of each of those services running on its own host as a single virtual machine. So you can see that data center, you know, all those VMs running in those data centers and it was okay. The developers had to write a bit of puppet code to get those virtual machines up and running. And then when it came to scaling, all we needed to do was just to clone the VMs. Simple. All right. But by the end of 2013, I think we had around 1200 VMs running around on all our data centers. So loads of virtual machines. The infrastructure teams, which is the team that I'm with, infrastructure engineering though, had a different viewpoint on what was happening. So that's us, they in a life of us. This is what we were doing, herding cats all day. It's because everyone was doing things on their own. There was no standardization. Everyone had their own logging. Some people had their own metrics. Some people even wrote puppet code to create their own logging cluster, for example, their own metrics collection cluster. We spent most of our day actually just reviewing puppet infrastructure code. How many of you have done puppet here? Do you like it? Raise your hands to anyone who likes writing puppet code. Right. So it's even worse for developers, right? Because I think switching from actually developing in your app to actually doing puppet code is sort of like contact switching and we had puppet code to do everything. So not just bringing up virtual machines, but also monitoring metrics. Everything was done through puppet infrastructure as code. So infrastructure engineering felt, you know, we were sort of to blame for, you know, introducing this. This is what the whole development life cycle looked like from a developer's point of view. So they were falsely just there, local bill, provision, metrics monitoring in the local bill, developers wrote code. Then they had to write infrastructure code to actually test it in their local environment. So all the developers actually pulled out the central puppet repository and then wrote infrastructure puppet code to actually bring up a virtual machine instance that would run their code. So things like dependencies and, you know, extra service like running cron jobs, for example, everything had to be written by the developers to get this done. Once they have tested this on a local vagrant bill, so we use vagrant, they would actually raise a review to the infrastructure teams to actually push push this out into production. So infrastructure teams, they daily reviewing all those infrastructure code. Once it goes into production, it's only then that you can actually provision those VMs out into our data centers. I think our puppet code base grew up to about one gigabyte of source code. That's how big it became. All right, because everyone was doing something different. And same with the metrics part, right? It was also puppet code. Everything was controlled through puppet. Monitoring was the same, but the good thing is everything's view as code, right? Infrastructure as code, everything's there. So our developers, this is how they felt about puppet. Something had to change. Around 2014, we started to look at mesos. And we explored the possibility of instead of using VMs, we could actually run them as meso services. Well, that solved a few problems. So instead of writing puppet code to deploy virtual machines on your own environment into our data centers, now this is how the development lifecycle looked like after we had mesos. So look at the purple boxes. It's blue here. Purple boxes up there. Now all they need to do is after the right code, look for Docker testing, push it to a Docker repo, and then all they needed to do was push this Docker images into the various mesos clusters. So we have a few mesos clusters, some running in a cloud, some running on-premise. They're globally, so there's we have data centers in London, we have data centers in United States. So each service had to deploy to one of those data centers. The metrics part of things will still remain the same. They still had to write puppet code, monitoring was the same. Now looking at this, the infrastructure team thought, okay, let's see what we can do with metrics. Perhaps we could standardize the way metrics are collected since everyone is on mesos now. All the metrics, we could get most of the metrics to the APIs. For example, mesos APIs, the Singularity APIs. So we use Singularity as the scheduler, and we design a metrics pipeline. So we wrote something called mesostats, it's open source URLs here, that collected stats from mesos. It goes into Singularity, goes into mesos, collects those metrics, pumps them into a Kafka queue, and then it goes into a Graphite cluster. Anyone use Graphite out there? Yep, lots of Graphite users. I love Graphite. The amount of metrics we pumped in was so much that we had to change the default carbon relay, as you can see there. It's a carbon C relay. Do you guys use carbon C relay? Yeah, it's great if you run into performance issues. You know, it's a to replace it with carbon C relay. And then we had Grafana as the front end where all the dashboards were made. One significant thing that we did with Grafana is to use Grafana lib. What we did was we created templates for dashboards. So any application team that started using mesos would get a dashboard created automatically for them. So that would be a nice, so just start running a mesos. You get a dashboard that shows your CPU usage, memory, and various other stats all for free. You know, you don't even have to touch a thing. The other nice thing that we did with the Grafana dashboards is not just put graphs, but to actually put text. So you can't see the text there, but it does explain what those graphs mean. So if something is created automatically for you, most of the time you don't realize or don't even know what it is. So we had to have text in there because all this was templatized. And developers were really, really happy because, you know, deploy a service in mesos, I get this dashboard, I even get help text that tells me what those dashboards are. So everything was automated. The other thing that helped with those dashboards was resource usage. So this is an example singularity task which requests 256 megabytes of memory. On those dashboards, we actually show the requests as well as the actual use. So the request is in red. The actual use is in yellow down there. So you can see memory has been over provisioned for this box. And the orange graph there is a recommendation from the infrastructure teams. This is probably what you should change it to. So a lot of new developers, especially new ones, would just put random numbers, 256 megabytes of memory, 0.1 CPU. Sometimes I need more CPU, 1.0 CPU, but those numbers are actually quite random. I'll just come up with some number and I'll just use them. But they actually involve wastage on our resources. So having this sort of graphs actually does a bit of governance on the actual resource that are being used. And our finance team really liked it because it saved quite a lot of money. So all the resource usage was optimized. The other thing that we could get from those metrics and usage graphs was right-sizing in our cloud instances. So for example, changing from R3 instances to M4 instances, actually M4 to R3 instances saved us 20% of our cloud usage cost. But it's all about having those metrics, common metrics, and being able to make sense of those metrics that you can make those informed decisions. Now after the automated dashboards, this was what the development lifecycle looked like. So we've automated the top bar. The only optional metrics that needed to be done was if some application had some custom metric that they wanted to be sent. For example, say JVM heap size, for example. So that they probably need to write extra code to do. But all the base health and welfare like CPU, memory, everything was done automatically. So the infrastructure teams then started to look a bit more into this and we had the final piece of the puzzle there to solve in terms of monitoring. And we also look at the above the way our developers were doing deployments and we wanted to make things a bit better, as usual. So we developed something called SUE. It's open source too. What it is is a global deployment tool that we use internally. And what it does is it abstracts out all the cluster information. So all the developer needs to do now is to write code, do SUE deploy, that's an internal command. It actually builds the Docker images with some extra meta information and goes into our Docker repository and a deployment to any of our mesos clusters or environments would involve just a manifest change. So they would have a file. I'll show you contents of this file later on. Say change the, so for example, if I want version 0.1 to run in the North America cluster and the London cluster, all I needed to do was to change the contents of this file and the SUE service would do the rest. All right. The other motivation for having a central deployment system like this was so that we could have governance in terms of knowing what was being deployed, what the versions are. So the security teams would probably have hooks in there that could look at, you know, versions of libraries being packaged in if they were, some of them were out of date. We could also get statistics based on deployments. If something broke on our website, having a global deployment manifest would show, okay, what were the three last changes that went into the open table site. So this made a lot of sense for us. Now, this is what a global deployment file looks like. So there you can see in the top line. It's just a service name, emails of people of the services, similar to the VAM deployment file, and there we would have different clusters, instance types, memory thresholds that we need to have. But the other thing that we added in there is the ability to add monitoring as well. So with this single file, you get to define versions of your code that you want in the environment. You specify number of instances that you want, you specify the types of monitoring that you want, everything was automated then. And then we get to this. So with Sioux, we've all those automated monitoring, metrics collection, this is where we got to. So with all this effort that we put in to make everything easy for developers, the pickup of using mesos was very high. Everyone wanted to move to mesos because of this because previously they were writing lots of puppet code to do all the plumbing, but by moving to mesos, all they needed to do was have code and a central single deployment file. So that was one of the main high points of our mesos deployment. The last thing that I want to talk about is logging and how we do logging. So one problem we had with microservices was everyone was logging on their own. Different teams had different standards. There was no consistency. For example, restaurant underscore ID is actually logged by one team and then in the next team, it's called RID. Another team is called REST ID. So it can be, you know, the same field can be named differently. The other example as well is for field types is if you're logging duration, for example, how long a request took. The name of the field could be the same. It could be duration, but one team could be using milliseconds. The other team could be using seconds. So they needed to be a way to standardize all of this if we were going to make full use of the logging data and make sense of it. So what we did was create a global unified data model for logging. We also built a central logging system based on LogStash, Kibana, Kafka. Anyone who wanted to use the central logging system would need first to define a logging schema. And this is how, and that logging schema needed to be reviewed before it got accepted. But what that does is it allows us to ensure the uniqueness of the fields. It allows us to ensure that all the fields match one another. And from that, we get to build pretty cool stuff because then we had fields that match. We could see a request ID from one service going on to the next service, all those match. Okay. So every request that goes into an open table has a specific request ID. It's actually a UUID and we use this request ID to track the response onto every microservices that we have. So we built this tool called Timeline. It actually shows the request going on to different microservices. Again, it's open source. Feel free to go out and have a look if you want. And a request that comes into the open table website we can look at it in terms of Timeline. So here you can see those green bars are when things actually hit the service on the left. So I'll do a quick demo of this after the stock. But what that allows us is to identify bottlenecks in services. For example, if you look at the graph here, you can see a bit of white space on the left. Why does that service start only after 70 milliseconds? And then you can see those two long bars there. Those are probably candidates for optimization. So having tools like this allows teams to actually go in and look at dependencies between services as well as how to optimize services. All right. Oh, actually it's demo time. So I'll quickly... So this is the open table website. Okay. I'm going to share something with you. You can actually tell everyone that it runs of mesos because if you scroll down to our website, bottom most, it's a trick. You highlight the invisible down there. You can actually see mesos there. It runs on mesos slave32-prod. And in this request down here, we have the version names, the bills. But the important thing I want to show you guys is the request ID. So anything that comes into our open table website, we have a unique ID down there. So I'm going to take the request ID from down here and I'm going to paste this in our timeline tool, the resolution. There you go. So it's real time. There you can see the services on the left and just a hit on a single website involves quite a few microservices. So you can see different restaurant APIs, reviews API. And clicking on one of these would actually show the lock string that was used to generate this. So using this tool, teams can actually see what the bottleneck is if the website is low. It even, if you scroll down to this, it even has logs. You can find logs by clicking on this. So if I go up here, paste in this. We know where the log files are. We know which mesos hosts it ran on, various things like that. So having a global schema for logging allows us to build fancy tools like this. Resolution is very small. Sorry about the resolution. I mean, this site actually shows open table reservations on happening in real time. But as you can see, resolution is a bit small. But it's a global map that shows all the reservations. Even now, you can see multiple reservations coming in from North America and different countries. So key takeaways when deploying mesos in the environment is map out the developer workflows, constantly look for opportunities on how to standardize, automate, and enhance. Make metrics and monitoring part and parcel of the mesos service. That's how you get quick adoption of mesos, at least for us that works quite well. The engineers don't always make the best choice out of resource usage, help them make an informed choice with metrics and monitoring and the tools that you can build. And having a common deployment pipeline allows us to build various tools that could hook in and standardize those microservices in terms of security standardization. Also, finally, a global data model for logging allows us to make consistent analysis because the fields are consistent and allows us to build tools on top of it that makes analysis and troubleshooting much, much easier. So that's the end of my talk. I'm going to leave a bit more time for Q&A. If there's one. Hi, Tim. Hi, Jay. I always ask this question now. How did you solve your database problem? Database problem? Yes. How does that work in your whole self-provisioning, teams run their own things, mesos world? Okay. So for database, we haven't looked at persistent storage yet. So our database still runs off the Great Puppet Codebase but we're looking into persistent storage in the coming future. For example, one candidate is our Redis instances that we have. So we're looking into that first. And they would run on your cluster? Yes. That's what we're looking at right now but they currently don't. Clear answer. Thanks. Hey. Thanks for the talk. How have you found Singularity? Can you maybe tell us a bit about how you've used it? Like any war stories or? Yeah. Yeah, sure. I mean, the use of Singularity goes back a long way. I mean, it's back in 2014. So at that time we evaluated quite a few frameworks and Singularity did seem the easiest of all the frameworks plus the developers at HubSpot who developed Singularity were very close to our core platform team. They were the guys that brought in Singularity. In terms of use of it, we found it easy. I mean, the interface is very easy. The APIs are great and it's working out for us. So we have no, of course we're evaluating marathon and things like that, but as of today, we haven't found a need to change from Singularity to something else. Just a very brief follow up. Could you give me some picture of the scale of the number of apps that you're running under each Singularity? Do you have a Singularity instance per Mezos cluster and how many apps or tasks are you roughly managing? If you can share it. Yeah, yeah, yeah. I think on the smaller clusters, so all our clusters run a single instance of Singularity, actually run single instance but pre-notes of Singularity in there. The size of, I think these, so it scales quite a bit because as you know, Valentine's Day, we go quite big, but we've gone up to 100, so it would be probably about 800 slaves. Probably during Valentine's Day, go up to 800, 900 slaves per cluster. The amount of services would probably be in the range of 120, which it would be about 2,000 to 3,000 tasks running. Yeah, those are rough numbers because depending on the time, we do scale quite a bit up and down. Great, thanks very much. All right, thank you. Thank you.