 Okay, it sounds like we're ready to get started. Thank you everybody for coming. Welcome. I'm very excited to be here at MezosCon Asia in China, the first one. I'm honored to be a speaker at this inaugural event. I think it's a very important step. I'm also very proud of IBM, my company, for being a diamond sponsor of MezosCon. We're a big supporter of Mezos. And yesterday in the keynote, Ben Heimann announced that two IBMers who were in my department became committers. And I just want to say also that our business development that we're doing is mainly, or a lot of it is focused in China. So this is China is the hotbed of Mezos development here in IBM. I also want to say that I'm very blown away by China so far. I think here for two days it's beautiful. Hangzhou is beautiful. The technology, the high-speed trains, everything. I think it's great. And I'm only here for another week and a half, but I can already tell us not enough. Thank you for hosting me, a great country. The topic for today is Lessons Learned, running Watson on Mezos in production. My name is Jason Edelman. I'm in the IBM Systems team. Like I said, my team is responsible for the bulk of the Mezos contributions in IBM. We also work on, we're also using it in one of our products. The co-author of this presentation is Chris Luciano from the IBM Watson Group, Watson platform team. He could not be here today. He was supposed to be a co-presenter. He's provided a lot of the content that you'll see here. So I will try to do it justice, but I can always take questions to Chris. Okay, does everybody know what Watson is? Everybody? Maybe some people not? So if not, I'll summarize. Watson is IBM's main cognitive computing platform. And Watson, in 2011, so it's already five years ago, Watson became the first computer to be humans at Jeopardy, which is an American game show. An American game show that requires a lot of knowledge and understanding of human language, of content, a lot of information, but also riddles, grinds, tricks, things like that, that typically machines can't do very well. So IBM was able to get Watson to do it. And another thing was doing it as fast as it can. Humans can process cognitive information very, very fast because our brains are pyramidal. It was a challenge to get Watson to do it. In the end, this is the machine that did it. It had 90 power seven processors, or 90 power seven servers with 2,880 processor threads to get that job done. So it won the first prize, it got a million dollars for IBM, which we gave to Sherry. Okay, so I wanted to give you that context because this is the, Watson is what we put on top of Mesos, or a form of Watson, and I want to show you how we got there. So this is a brief history of Watson. You can see that in 2006, the research department and IBM started to work on it as a project, but they do something about cognitive computing. They ended up going for this grand challenge to beat Jeopardy, which they did in 2011. After that, IBM looked at how they could commercialize this, turn this into something that customers could use, the first in healthcare, and afterwards, shortly afterwards in financial services, and they went out into broader industries. These are all kind of custom solutions done for customers in these areas. The last one is the one I wanted to point out. This is the Watson developer cloud. So we looked at, basically, IBM wanted to open up Watson's services, the valuable content that's inside of Watson, and make that available for developers to start using. So that started in 2013, and that's what we're going to talk about. If you, has anybody, or has anybody been on Bluemix? Everybody aware of Bluemix? Good, lots of people are. So this is IBM's developer portal, or I guess developer cloud. So there's a lot of services in Bluemix that developers can use to get applications up and running quickly on the web. This is where you'll find Watson as well. So on your Watson, you'll see the services, and this is the piece. There's 16 services there for now, but there's a lot of things coming all the time. It's been developing very rapidly. A lot of these services are currently running on Nezos, and we're working on trying to get everything running on one platform there, so that process is, that work is still in progress. Okay, I wanted to say a little bit about the team that did this, right? So the work that we're talking about running Watson on Nezos and Nezos is done by the Watson platform team. So this is a group of people that were originally the cloud arm of Watson that ended up delivering foundational services to all the Watson services, all the Watson applications. So they have microservices, architecture skills and resources. These guys are good, you know, they're developers basically, so they have good technical capabilities. They currently have an ecosystem of about 40 microservices running in there. You know, the customer-facing ones is a subset of that, but there's about 40 right now, 40 plus. It's running on a mixture of containers and managed by Nezos, Marathon and Netflix, OSS, running on the ends of their metals. I put together a little bit of a timeline of the Watson developer cloud. So I said that it started in 2013. It became, it launched in November 2013. That's only three years ago. My point here is that this is a very short timeline, that this is all happened, right? So taking Watson, making it available to developers three years ago, or the very first services came out three years ago, two years ago, basically December 2014, there was eight services in data. A year after that, we had some issues, which I'll talk about. Less than a year after that, we started moving to Nezos. And then one year ago, September 2015, we had our first Watson services running on production with Nezos and Marathon, the first on 50 nodes, basically. So just started. So one year ago, we started it. Today, we've rolled out a bunch of things since then. Today, we have 25 of those 40 services. 25 of the 40 services are actually running on Nezos and Marathon, and they're on 1,000 nodes. So we went from 50 to 1,000 nodes in the course of a year. Okay, so now diving into this a little bit, what is the Watson developer cloud look like? Or at least the first iteration. This is basically going back two years. What we started with is we took the Netflix OSS stack, so we have the Zool for filtering, and we have Eureka for service discovery running on VMs, using Asper to deploy. This actually worked relatively well for most of the services, so we were able to manage this. And this was really during the, when most of these services were running in beta. But one of the services hit a problem. So our rank and retrieve service is based on Apache Solar, which is a search engine that requires loading up of lots of data. So to deploy it, because it requires that, it's not multi-tenant. We can only have each tenant requires its own instance of solar. So what we were doing is we were deploying VMs for each tenant to cannon. Now the problem with this is that we want the deployment to be fast. When a new customer signs up for the service, we want to enable it right away. But deploying these VMs, first of all, there is a manual step because we're using Asgard, mainly a GUI. So the request would have to come in, the operator would have to get the request, provision the VMs, etc. And then the customer has to wait. So that wasn't really good. We wanted to speed this up. We tried automating the VM deployment. We tapped into the Asgard API and we tried to run scripts to get this to deploy. It wasn't fast enough. Deploying VMs, VMs are still pretty big. It can take between minutes. And on the cloud platform as well, there's certain other variables that can take longer. So this still wasn't good. So this kicked off an investigation into containers. Can we run solar? Can we run this retrieve and rank application on containers? So this discussion happened about a year and a half in 2015. So at that time, there were two options. Really, we could for orchestrating containers. One option was Kubernetes and the other one was Mesos and Marathon. Now, the Kubernetes community was pretty much brand new. So the team felt it was really young. There was a community felt divided. And also, there were some issues with using the overlay networks and how that was going to fit in with our existing infrastructure around Zool and Yubica. And then the other option was Mesos and Marathon. And that actually was quite solid. Easy to get started with it. It fit with the existing net. It wasn't too hard to adapt to the Netflix OSS stack. Had a strong community, good production support, et cetera. So we chose that. So last year, when we said we went into production, September 2015, this is what our stack looked like. Basically, we had 100 DMs initially on 15 nodes. We had Docker 1.7, Mesos 0.23, Marathon 0.9, that's all. And now this is where we are today, which is pretty much where we were then. But just a little bit more detail. So you can see the API endpoint at the top. We've got a data power edge device that provides our public IP. The request coming in, I have an LDAP server there. The request coming in goes to Zool. Zool does the filtering. Passes the request through by ribbon to now either on the bottom here we have, this is kind of the existing case, we have stuff running on, still have some stuff running on DMs and bare metal, and then the rest of the stuff is running in Docker with Mesos and Marathon on some bare metal. We have our Eureka Discovery service there. So basically, we're trying to get everything to as much as possible, put things on a common resource manager and container cloud. Now, moving, I told you that the retrieve and rank service based on Apache Solar was the whole thing that initiated this discussion on containers. This is the reason we went to containers was to speed this up. But moving to containers, we also knew that we had a risk. The risk is that if one of those, because Marathon is a dynamic scheduler for resources, things can move around and that wouldn't be good when we're putting stuff into the local data store which we're using. And I should say that we did try and look at whether we could use remote storage and that was not a good option here because what we were told was that solar would not perform well with remote storage. We needed to have local storage. So that was kind of the constraint. So based on that, the solution that the team came up with was to use solar cloud. So solar cloud provides mirroring between two solar instances. We make sure that those two solar instances are on separate physical machines. So the logic is one of the machines goes down. We can still recover on another machine and rebuild the data and maintain high availability. Okay, so this was the hope. And as I said, their hope is not a strategy. We'll see why in a second. Okay, so now what I want to do is I want to talk about some of the outages that we have. So you can't get through a year without some growing pains. There were three outages I want to go over. All of them obviously are resolved. We found workarounds. We managed to deal with them. But I think it's instructive to go through it just for anybody who's looking at deploying a new mesos and marathon container cloud to understand the kinds of skills that are required to support the kind of issues we possibly can face. At least these are some of the tougher ones. All right, so the first one was the zookeeper charoute. Here the issue, what we encountered is we were working with the system adding new applications. And each of these, and this is particular, again, to the granular tree using Apache solar, we're adding new tenants with new instances. And as they came on, at some point we had a problem where basically we had no communication. We couldn't get communication with marathoning. It turns out that the reason for this is that marathon stores its application information, application metadata and environment variables for each of these applications that it runs in zookeeper, in one zookeeper, Xenome, which has an maximum capacity of one megabyte. Now, if you think about an application, we started to have thousands of applications. If you have a thousand applications, each one of them has a thousand bytes, you've already got a megabyte. So we hit this problem and this caused an outage. We had to quickly diagnose it. We had to upgrade to a version of a marathon that had some compression, that kind of improved the situation of it. I know they've been working on this and there's some further enhancements coming on this, but this was something that hit us as a surprise and we had to work it out. This isn't an outage, but I just wanted to point out something interesting about this, like how we've also had to deal with as we've grown. Because this Retriever rank service creates a new set of instances for each tenant, we have thousands and thousands of them, and it's very hard to see, but what this is, I kind of pulled out the mesos framework table and you can see that there's six frameworks here. They're all marathon. So we had to limit our marathon instances to a thousand applications and there you can see they're all a thousand. This is because even if you can get past that the Zookeeper Cherute issue, the marathon really can't handle thousands and thousands of applications so we started to have a problem when we hit 3,000 it was absolutely not working. At 1,000 it's kind of variable so we basically have limited it to 1,000 on each one and we have to have multiple marathons running on top of mesos and we have to have some load balance in between them to distribute it. This is the second. So now we describe what it was for Apache Solar we have mirroring, right? So we have two instances for each tenant running on two different nodes and that's supposed to protect us. What happened to us actually is that we essentially lost contact with all of our containers. The way this happened is we had the network that we're running on wasn't always reliable so we had some outage where where marathon and mesos were not talking to each other the connection was lost for a significant amount of time. We think this is the cause of it after that significant amount of time the connection was re-established but when marathon reconnected with mesos mesos thought it was a new marathon gave it a new framework ID so now we have a marathon running with a new framework ID we have all these original containers still running with the old ID so they can no longer communicate with marathon that means they're basically going to be dead so what we tried to do is we didn't know how they got into this situation but because it's running in production we had to deal with it we tried to hack it directly into the key for registry we tried to reprogram those framework IDs for each of our containers it worked temporarily but eventually everything got rescheduled so essentially this is the problem this is the problem with stateful services using stateful services you have risks like this so to get out of this we had to do a bunch of manual work we had to move around all the data to get to the right container after it was rescheduled but for the future to prevent this in the future we had to develop our own pinning functionality so that a container is always going to start on the same node with its data from that on we had to develop all this infrastructure ourselves on top of Mezos and Marathon Marathon doesn't support persistent volumes yet or it supports it in data so this is the third and last outage I wanted to cover this one is also on the let's say extreme side in the sense that it's a race condition because of our network problems we had two network splits that happened pretty close together and when we have a network split with Mezos if you still have a quorum it will elect a new leader so each time it elected a new leader what happened is that one of the leaders that was lost that was disconnected in the first round became a leader later but while it was disconnected it missed a bunch of updated state information so it had stale state information at the end of all this when all these when all these Mezos masters try and talk to each other try and reach consensus on the replication log they couldn't or it got into some kind of state where it would continually they would continually disagree on what was the state and they would continually elect any master the cluster was unusable basically we had to kill all the replication logs reboot the whole cluster started up again but because we were able to track down this scenario pretty precisely we were able to reproduce the problem one of the other things we did is we knew that our network stability was also a network band played a role in this so we also upgraded our network and moved our machines to their bare metal that would support a faster network so that also was part of the mitigation but again this is the kind of issue that you can encounter when setting up this kind of club ok so now I have a bunch of slides I want to switch to lessons learned kind of takeaways from the experience that the Watson team had the first one is about stateful services and I know there's a talk going on maybe next door about stateful services how the stateful services are difficult so I think we agree I guess the message here is if you can if you don't need if you can work with centralized storage if you can afford SSDs if you can have a fast enough network then it's better to do that than to have local data but if you have to have local data then you have to be aware that you're going to have challenges potentially a distributed file system can help with that but the current solutions of pinning pinning your workloads to a particular machine that has the data is kind of a hack and it's brittle in a sense as you scale up you're more likely to have failures you're more likely to have an issue with that pinning strategy lesson is just have a fast network Nezos is very chatty there's a lot of logs being transferred there's a lot of messages and as a result we found that even with a relatively small cluster a one gig E network was not enough it would often get congested so really you pretty much want to get on a 10 gig E network and if you can't do what you can to try and reduce the traffic manage your logging better reduce that and work and you can extend your time okay this one is about about the design of your microservices so as you saw the way this process started with Watson it was originally running on VMs the first set of services that were designed were all running on VMs so when we moved to this container based microservice architecture a lot of the development teams just simply ported their VMs over to containers and that doesn't work very well tends to take they tend to have excessive memory requirements there's additional processes running that really should be split out and which makes it also difficult to optimize your scheduling of resources so this is something that the Watson platform team has ended up spending a lot of time training the development teams basically conveying information about good microservice architecture how to minimize the size of your containers so that the whole network is more flexible and more resilient to the whole system another lesson we learned was about workload scheduling so as this system grew bigger there was more challenges with trying to get good utilization out of it you know that when you're when you're running when you're running workloads you specify your CPU and your memory and you also know that your machines have CPU and memory limits it is not easy to get the high memory firing tasks to run on the high memory machines at least these capabilities aren't supported out of the box and marathon so right now there's really the long-term solution here is to have more advanced workload scheduling capabilities and that's something that the team that I work on at Spectrum Computing has a lot of experience in both to help the community with but in the short term basically we're considering how to use roles and attributes to get this to kind of at least at a coarse-grained level place the right kind of workloads on the right kinds of machines this lesson is about auto-scaling so here marathon doesn't support auto-scaling out of the box either and we're talking about auto-scaling of containers so as workloads increase and your containers if you have containers that are supporting multiple users and the users start congesting there's more activity on your container you would like it to be able to spawn more containers you'd like this to happen on each microservice individually so that you can get some benefit of trade-offs between activity between different services currently we don't have this implemented so what we're we looked at some options there is the marathon LB project which is potential the problem there is that for us that didn't work because it's not compatible with the Zool infrastructure that we have it depends a lot on H&A proxy we also looked at Aurora the team wasn't really ready to move to Aurora because it didn't support the REST API and the team wasn't ready to use thrift so this is an area that it's important to have marathon out of the half this we're still working on this solution so something to be aware of and about metrics metrics are very important when you're running a cluster in production especially as it gets larger I guess there's three kind of sub points here one is when you're doing maintenance on your cluster and you have a lot of different services running on a single machine if you take that machine if you try and upgrade that machine you're going to get a lot of alerts from all the services there's a lot of duplication going on there and eventually it's difficult to figure out what's going on you'll get as an alert fatigue we're starting to use Prometheus we're starting to look at that that has a good capability for alert management to de-duplicate to reduce the noise so you can understand what's going on so I think we think that's an important step to take again from managing we'd like to manage a container better so we want to understand what is the usage the actual CPU usage memory usage going on inside each of the containers then more capabilities there is very important so that you can manage things like over provisioning and make better scaling decisions and then capacity planning of course you've got to understand it actually I think in those we really have high level metrics but you actually want to understand where are all these workloads placed and what is their actual utilization so that you can make better decisions there about capacity is it an issue of just adding more nodes it is an issue of being more efficient in your scheduling regarding deployment automation so I guess two points here one this we started using Ansible to deploy our services onto the nodes and we discovered that as it started as the cluster started to get quite large large elaborate playbooks can take quite a while it can take quite a while to roll out a patch or changes for a big cluster with this kind of push model so we are now putting more things and packing more things into a depian to download that and have the Ansible playbook trigger that but we're not putting too much into Ansible right so that's working for now you might look as well at more pull kind of technologies when you get into a bigger scale and then for maintenance so the maintenance that are in nezos are not supported in marathon yet so one of the things we had to do here is kind of hack our own maintenance strategy we basically reserve resource we briefed something out using getting Zool to filter out some of the traffic on the nodes and bring down reserving them in nezos so that they wouldn't be closed and then doing more maintenance that way so I think the absence of a good maintenance tool in marathon is something you have to think about and something you have to be prepared to do a little work to attain alright this is actually my last lesson learned here is around chaos testing this is a strategy from Netflix and we found it very useful basically using testing frameworks and chaos monkeys to generate to cause some disruption to your production environment or pre-production environment for testing purposes to see if you can really if all these tools will work right so the couple scenarios that we're testing is we take down a percentage of the nodes so we load up the traffic in a pre-production environment we'll take out 25% of the nodes and we'll see how well it handles, can it keep up its SLA and what's its recovery time and similarly we'll do this and we can take out a very busy node because another test that we do we've developed this into a Jenkins script which each of our teams now can run for self service so this is something that you can all do before a employee in the production and that's basically it I think the summary I guess the takeaway from this is that this hasn't been a very successful exercise in adopting this technology I focused a lot on the challenges that we had because this is kind of what we've learned out of it but we have we've gone from 50 to a thousand in a year we have achieved we calculated that we achieved four nines availability for all of the months this year except for one month which was when we didn't have enough capacity we're running smoothly in production this service is expanding quite rapidly all of our Watson services that we're offering having good architecture development skills is really important for a team to be able to support this open source technology in-house in production you need to have that because you'll have to be prepared to deal with some pretty tricky situations our advice for new users is be aware of stateful services until hopefully there's some better support for it get the fastest network you can make sure that your developers know how to design for proper micro-service architectures there's more work to be done on scheduling, on auto-scaling metrics maintenance and do chaos I think that's it I think we're I think IBM is very happy I speak on behalf of IBM we're very happy to be participating in this community we look forward to working on some of these problems to help make those better and improve it for smoother experience for all users in the future thank you