 All right, good morning, everyone. Welcome, all of you here. So I wanted to jump right in here with some introductions. We can just sort of go down the line here. But all of the companies represented up here. We've got Twitter, Yelp, Uber, and Netflix. All of you have people here at the conference giving talks. And we know a lot about the clusters and the infrastructures that you guys are running. So I wanted to ask personally, as part of your introduction, not necessarily what your companies are doing, but what you personally in your career brings you up here on stage for a panel on the future of cluster management. So can I start with you? Hi, everyone. So my name's Ian. I think what is unique for me, I sort of started off with containerization very early on with jails on FreeBSD many, many years ago before joining Twitter. I joined the Mesos team as it was just being developed and growing. And so it was well underway. But we saw Mesos develop to take much of Twitter's workload. And so I've seen it scale through probably almost an order of magnitude inside the company. And now we're at the point where we're looking to bring new workloads onto the platform as well. So it's very exciting for me. And I'm Sam Eaton. I'm very loud apparently. What do I bring to the panel personally, apart from British Charm? I think I've spent the last 20 plus years working professionally in the area of operations and infrastructure back in the days when they used to call it systems administration. And like Ian, I've also worked with FreeBSD Jails, Solaris Zones, and seen containerization in pre-overtime. And in our time at Yelp, we've been working with Mesos for the last four years. And at this point, it runs all of our production infrastructure and all of our software building tests. So we've seen it grow and change and seen the awkward rough edges. That's been great. Hi, my name is Jitang. I work from Uber. So when I joined Uber less than three years ago, we started working on bringing container into Uber without class management. It was a fun period, but we quickly realized, OK, we cannot let the mask keep going. That's why we started introducing Mesos to Uber about a year and a half ago. And it's a very big success. We have been quickly migrating on most of our production traffic onto Mesos. And we are also learning a lot of things. So maybe you can share something here. Hello, my name is Sharma Podila from Netflix. I've been there about four years, mostly focused on designing cloud native schedulers that run on top of Mesos for mixed workloads, stream processing, batch service kind of workloads. Before that, I spent quite a bit of time developing schedulers for HPC-like batch workloads. And a long time ago, it used to be called Compute Farms. Nobody called it the cloud. So yeah, I've been doing cluster management for a while. Great, thank you. So for our next question, my own background is very open source strong. So I've been working in open source communities for over 15 years, starting with Linux distributions and then most recently working on OpenStack and now Mesos. And I like to say one of the reasons I joined Mesosphere is because I really believe in open infrastructures. But I am incredibly biased. So I wanted to ask all of you, where do you see open source playing a role in this community that we built in the industry today? How important is it? Is it essential? We can start again. Sure, well, mine will be short. When I think about open source, when we think about open source at Netflix, I like to borrow the phrase, we stand on shoulders of giants. We see tremendous value there. And where we see value of us adding to it, we contribute. There's plenty of open source from Netflix on GitHub. And related to Mesos itself, we open sourced our scheduling library called FENSO a couple of years ago. So we love it. So I mean, I quote my colleagues that Uber is pretty much built on top of open source from the very beginning. The open source is pretty much in every bit of Uber's infrastructure. There's some piece of open source software. I think in collaboration, one really great point of open source is that we can use existing story that's already better tested. We know that we have been growing at a crazy pace. That we know that we are barely catching up with our own growth. But with open source, we got to say, okay, we know this thing will scale in the next couple of quarters. So at least it will give us a lot of brief time to focus on other aspects of making it working with other parts of Uber infrastructure. And from our side, I think it's pretty hard to stand on stage and kind of go, no, open source is a bad thing. That would be a hard argument to make. Without it, then I don't think any of our companies would exist or be as successful as they are. I think working on the kind of things we do, contributing back to open source is a vital part of it. And so I'm not going to stand on stage and say it's a bad thing. So I can say that there are some trade-offs with open source, but on the good side, I think is that open source brings a lot of diversity in terms of everyone's perspectives, what their actually, what their workloads are, how they think about problems, and also diversity in where they are in their stage of development. So we, for example, started off a long time ago and we're seeing other companies now that are just starting. I think seeing that level of just all the differences and the different perspectives really helps to create a much healthier community. Great. So next I sort of wanted to talk broadly before diving into specific themes. What are some of the current, or some of the challenges that each of you faced while you were moving into these really large cluster management workflows? And perhaps for others who are in the audience looking to do this, is there anything that you did during this transition that was really hard to undo and you wish you really hadn't done? So we can sort of just jump around whoever has an answer. I think some of our challenges when we first started doing this is that back in the early days of it we started working with Mesos and containers more back before the 1.0 release of Docker and it was very flaky at that point. The promise was definitely there and it's way more stable these days but the early days of that were extremely painful and it was a lot of kind of like, did we make the right choice here? Do we really want things to crash this often? And there was a lot of persistence required in that. And in terms of choices made early on in the process, I think it's kind of like, it's the peril of early adoption with anything which is that there are choices you make that later on the market matures, new products appear, people produce things that are kind of like, have less rough edges and if you'd have waited then maybe you'd have had an easier time but you wouldn't have been able to get the things you've done done. So I think it's like always that trade-off between adopting early and living with the pain and waiting for a more kind of polished thing to appear. Yeah, we had much the same thing where, for example, we don't actually use fast system images, we don't use Docker images or anything because when we built out our infrastructure we actually predated that. There was no real formalized concept of packaging up all your dependencies into an image. We've recently tried to address that and so we do have a talk from Santosh today about our challenges for that but we still haven't actually adopted fast system images for our containers because it's very painful for us to go from our current model where we run on the host file system. So I'm wondering if there's anything you would have waited on that you dove in really early technology-wise that you maybe wouldn't have? Well, I don't think we could wait. We had to move sort of fast and we had to build out the infrastructure. Yeah, we just couldn't wait. So it was possible because, for example, we run most things either on the JVM or Python so we actually made choices in the organization to say we can't quite solve this problem the way we want to so we'll choose this other way of building out our infrastructure. Yeah, I mean, besides the technology challenges, first we had to spend a little bit of time trying to define what the business value is and maybe more so for us at Netflix because before we moved to containers we already had microservices running on top of EC2 cloud using VMs. There was a scheduled service scheduler with autoscaling and all of that. We also had a good robust discovery mechanism and all of these things happening. So it was one of that and then there were specific challenges, technology-wise, making sure there's capacity for the applications when they need to run and trying to prove parity with what was already existing. I think one of the challenge with space is like remember that we moved from a non-cluster orchestrated workload into customer management. While we bring a lot of order and guarantees onto the quality, we actually see utilization starting to drop because we weren't able to pack things as crazy and pretend there's no quality problem at all. We're not able to address them. So we are still working on trying to bring the utilization just and curve back. And we don't necessarily need to run as hard as past but we don't want to keep the, get the machine more utilized and save the company a lot of money. So getting to our first sort of future question, what do you think our biggest challenge is moving forward? Just something quick and small, like just like your biggest pet peeve perhaps. Better isolation. That's probably kind of like the killer for us a lot of the time at the moment is noisy neighbors, tasks that don't cooperate or co-locate very well together. And so yeah, further advancements on that will definitely help us. I think I would say predictability. It touches upon some of the points there but predictability and the aggregator resources as well as down to an individual host. All right, so I sort of wanna start getting into some themes here. And the first one is around maintenance. So I worked on OpenStack for four years and one of the really hard things about working on OpenStack is it's got a lot of missing pieces. So when something goes wrong, it's really hard to track it down. So when I started working on Mesos, that was the first thing I went to. I was like, how hard is it to debug problems? Because as a community, we've done a really good job at the deployment story. We can deploy Mesos and we can get a system running. But once it's running, we need to be able to debug it and find problems and fix everything. So sort of broadly speaking in this space with deployment behind us, what are you seeing yourselves doing outside of the cluster maintenance-wise that you really wish were inside of the cluster? And what have you, I mean, are there tools you've built to manage things that you really wish were part of Mesos? Yeah, so we have lots and lots of tooling outside of Mesos, not only tooling itself, but also a lot of tribal knowledge as well and our engineers and how to keep things running smoothly. And there's a couple of things I think to there. One is around the general health of the agent itself. So it's offering an execution environment and some resources for the jobs to run in, but those need to be validated when the machine comes up or when the VM comes up and they need to be maintained. So maybe you lose a DIM or something like that. So that needs to be maintained. A couple of other things. We generally do rolling updates across the cluster where we update Mesos. That's all coordinated outside of Mesos. You know, we are made deploy to 1% and test out some new feature that we've got before we go sort of more broadly across the cluster. But yeah, we do have an awful lot of tooling outside. And for us, a lot of the extra tooling we've built revolves around scaling the clusters up and down. So we elastically scale on AWS a lot. We use Spotfleet heavily. And my colleague Kyle will be giving a talk about that later if you want to hear more about that. But that's mean we've had to build a lot of tooling around taking machines in and out of the cluster, understanding the cluster size and dealing with when we want to scale it that would be much easier if the cluster managers were more aware of what workload the frameworks were demanding from it, whether there was more work that it needed more capacity for we could do a better job of predicting its scaling demands. So I thought that in this point, like maintaining the Mesos itself, we have to do a lot of customer works on top of that. And besides that, we also want to help other infra teams like routing teams, monitoring related teams on how to manage their software and utilizing all the great permissions we provide. The thing is, because there are things that are relatively on the same level, there could be very strict dependency relationship between things. If you are not careful, we couldn't simply run them, there are things as applications, but you also don't want them to go to bare metal completely run as their own silos. So I think we create quite some work for ourselves to help these teams in special things. I hope customer management can find a solution for that too. Well, all good points. I'll just add one more point that's specific to us, which is we have integration into our cloud provider itself that is a code that's specific on the agents, and when we introduce new versions of those, we do have some pain in, because there's such a rate of change in that code, some way of introducing new agents that may actually be bad. There's definitely pain in those. All right, so I'm gonna put you guys on the spot a little bit with this next one because it's very related. You know, what things do you want to see built in and can you perhaps play a role in that since you all have this external tooling, maybe some things you want to open source? Come on. So I'm gonna say something, not answering your question. I think one of the difficulties here is that we've all developed these things separately and because I think it's out of necessity, these are very pressing problems. And so one of my concerns is that how do we come up with generic ways of solving these problems? I think that's one of the challenges for us is we have one way that we think we can do this and this is back to my diversity point. I think we need to increase communication around what people are doing, which is what we're doing here, but trying to find a way where we can come to some sort of consensus on what these common problems are. What he said. All right, we just created another working group, right here on stage. I'm not gonna do that. All right, thank you. So I sort of want to transition for a moment away from the underlying infrastructure and talk a bit about workloads. So the topic of mixed workloads comes up a lot. You've got a lot of different types of things running in your infrastructures and how they play well together. So whoever wants to chime in can talk a little bit about the types of workloads that you're using that are perhaps not really in a line with each other and how you're handling those workloads. And then sort of what the state of the art is with regard to handling different types of services and workloads. State of the art. Well, I'll start with ideally where it could be. People submit workload, applications come in and maybe they have a declarative way of saying, here's my service level objectives and assistance takes care of everything, whether it's batch workloads, stream service workload. And I would imagine within Mesa's community, a lot of people are using things like quotas, roles, reservations to achieve some of these. And we saw limitations in their current form. There's discussions in improving them. I think some awesome features would be coming up, but then we introduced some of that in terms of capacity guarantee. So when you mix workload, each workload gets what it needs. I think state of the art is somewhere around that. One thing that's still missing for us is looking at noisy neighbor situations on the agents that we test upon before and feeding that back in. So somewhere there is a state of the art, I think, for us. Yeah, I think the state of the art is there are a lot of useful functionality. There's a lot of interesting work going on to make this stuff better. But our dirty secret is, at least for us, the state of the art is statically separating our clusters to eliminate these kind of problems. That's why we care a lot about isolation. And I suspect that a fair few people are doing the same. Yeah, we do the same thing. So we have our status workloads running on one cluster and then our stateful workloads, our storage base, or our batch running in completely separate clusters. But I do wanna add one thing for our, even within our status workload, we do have a concept of production versus non-production. And so we've tried to push the state of the art in terms of over-subscribing the non-production workload. These are lots of test jobs or jobs that are generally idle. And we actually found that quite difficult and we have a talk about that later on as well. But it's even something as simple as identifying idle workloads and over-subscribing those resources to run other workloads and doing that all the way from the agent up to the framework, we found quite, to be quite difficult. So I think, yeah, to echo that point, like we are running different workloads in the same cluster, but we're sort of pretty much static partition of different agents. And even in this situation, we see that sometimes the converging time of scheduling, especially when new workloads comes in, it becomes unbounded. And we need to do special things and I think, I feel like that we need more primitives on Metal Slayer to help on this work or we need to find some other approach to address these. Yeah, that's a great segue into my follow-up question here is one of the really great things about microservices, especially I've heard a lot about this week, is that companies have been incrementally moving over to microservices and transitioning parts of their infrastructure and their workloads over to it over time. But in a case like this, with these mixed workloads and you're doing really hard isolation, could we make these changes to improve the situation incrementally or does there need to be a bigger shift there? So I think it's a combination of both. I think there are incremental changes that we can make. So we've identified different workloads inside Twitter that have varying degrees of requirements. Most of them actually have very substantial requirements before they could consider moving to a major managed platform and definitely very strong requirements before they could move into a shared platform where they were co-located with other tasks. But there are some workloads where, for example, they need some control over topology, over the placement topology on the infrastructure, which they've only got sort of small requirements that we could incrementally do. Yeah, I think incremental improvements are possible. We have some basics in the system and I think we're missing. It's a greenfield opportunity so it's easy to put incremental. But I wanted to add that there are technology challenges but sharing a cluster is also have people side of things giving the people confidence that their applications are going to get the performance they need even if they're sharing the cluster with somebody else. That's one of the harder problems on the people side of things to solve. Yeah, I mean, so you touched upon a little bit about specific changes that we can make. Do we have any other thoughts about how we can improve this isolation against noisy neighbors? When I was working on OpenStack CI, we were totally the noisy neighbors, so sorry. So I think there's two parts of the question. One is do we even need to run them on the same host? And this is not just mixed workloads. We do have noisy neighbors even within our status workload. The best isolation is to just run them on separate hosts. So if you could identify when there is contention. So we have the paper from Google on CPI squared. We're still not at that level where we can say these two jobs just contend for resources. Let's just run them on separate hosts. And that's the easiest thing that we could do. Yeah, great. Great. So that sort of gets us into our last theme of this panel. Right now it's a human decision about whether you wanna split these up between this noisy host and that one. You have to split them up based on what you know about the host. But one of the things that we've seen these big clusters be used for a lot in the industry is sort of machine learning and artificial intelligence. And that's focused on the workloads. So do you see somewhere in our future where, or maybe some of the work we're doing in the present where we can use the same tooling or very similar tooling to what we're using on our workloads for this machine learning work and put that back on our clusters. So maybe as you say, if the cluster can determine who is a noisy neighbor and then automatically isolate those or maybe you have other examples or some of the work that you're currently doing that has you guys living in the future already? Okay, I can start. Well, intelligence could mean so many things, right? And one of the things that we like to think about it is building upon what I said before, user submit workload, there's declarative objectives and the system figures out how to get it done without needing the cluster operator or the user to input anything else other than the objectives. And better yet, if people don't have to write their own schedulers that have smarts on how to get it done. I mean, the cluster manager should do it all. Intelligence is also in right sizing the containers. Users have a tough time predicting what resource usage they might actually see, both in terms of the size of a container and the number of containers, right? And intelligence is also maybe having some advanced users giving hints that it's like, yeah, I need two CPUs but this is how I plan to use those two CPUs and it's different from different applications. So it's not just the resource specification of what I need but hints as to how I'm gonna use it that would help intelligent placement of tasks and things like that. So one specific area of intelligence we are starting to looking at is how to improve our capacity planning story with cluster management. Like that may be sounds obvious for people but we are still, we have been experiencing the fast growing pace. So the planning is pretty hard in this stage and a lot of it's more like human ad hoc decisions rather than systematic data driven approach. So I think there's a lot of data, there's a lot of extremely valuable data in the customer layer already that can provide a lot of feedbacks and we can start into a market data driven approach rather than pulling teams. And on our side I'll talk about two things. One of them is the stuff we're already doing on our software build and test infrastructure called Seagull where we're actually learning how long our tests take to execute historically and using that to pack them more efficiently into bundles that take a constant length of time to run to reduce our overall test run times. And like that's been very productive but that's very much building outside the cluster. To propose a more kind of wild idea we do a lot of stuff with auto scaling on AWS at the moment in the kind of glorious future having a cluster scheduler that's aware of things like AWS costs and what the bid rates are for spot instances at the time and could kind of like scale the cluster in the most cost effective way and run the right instances for the workloads according to pricing would be a kind of an exciting thing to be able to do. How far away you think this is? A while. So I'm at a plus one on basically all of those things not so much the cloud because we run on our infrastructure. The one thing with Twitter is we run almost a lot of our workload on the JVM and so I'll give one last pitch. We have a talk I'll start to know with Joshua and Ramkey where we are working to automatically tune the applications running on the platform. So to tune the JVM using Black Box Learning where and this is tuning for a whole range of parameters where back to Sharma's point they just don't know how to tune the application necessarily and they shouldn't know it either. And particularly for us we have a very different platform where we have different generations of CPUs and different configurations. And so we're able to sort of work out the best configurations, the best parameters for the JVM for our users as part of the platform which I think is very powerful and that's stuff that we're actually doing right now. That's really cool. So sort of to sort of wrap this up as sort of a final question here. If you could, I mean a lot of the things you guys are mentioning are very, I think they're more in the near future. So if you could look down maybe five, 10 years, like what do you want your clusters to be doing for you that they aren't doing today that you haven't mentioned yet? In 10 years from now I'd like to not be thinking about clusters at all, like I'd just like to run work and not have to consider kind of like where it's running or what it's running on. We just kind of go, we want to run some stuff and we want to do it cost effectively, make that problem go away. Yeah, like we don't think about when you run something on your laptop about this needs XTPU, YRAM or anything else, we just say just run it. And so I think for our services, we want our customers to have to say run this at this sort of scale with this availability and just go from there. Yeah, I think so. Here's my card, make it happen. So I mean we are still driving our internalization so I feel like maybe dream that maybe after 10 years we don't even, don't need to care where the workload is running. It's going to run closest to our customer at the most reasonable location with the least current utilization and other things and our engineer will just speak and other things that happen magically through the internalization system. All right, anyone has any parting thoughts before we wrap up? Just where you think the future is and? I'll just share a quick thought I had earlier that we've come a long way. So like several years ago, we used to have resource sharing happening in actually a meeting of people and the way I like to describe to others how this happens is everybody comes with a baseball bat and say this is my quota, you're not gonna take it away from me. I think we've come a long way since then, there's a lot more automation and I think we only have forward progress to make. All right, thank you everybody.