 So, back in the day, things were different. So, we had mainframes, which were very big and homogeneous. People would normally think about the workloads that they would need to run on those mainframes, but not carefully buy those things and then run their workloads. And so, these machines were very big. There were special people operating them. But the good news was that everything was homogeneous. So, you knew the system that you were dealing with. You knew very well the applications that you were going to be dealing with. Somewhere, in the beginning of the 90s, end of the 80s, this changed. And so, people switched from mainframes, which were very big and expensive, to commodity computing and then later to virtual machines. This, of course, came with its own set of challenges. So now what we had to do, we had to slice and dice a large array of machines and think about which applications are running where, how are we dedicating resources to various customers. And then, of course, we had to build some slack again into the system so that in case some of these machines failed, and these were failing much more frequently than mainframe components for various reasons. So, we had to have some slack to make sure that if something happens, we can, again, shift workloads. So, now the data center operators have different set of challenges where now we need to think, okay, how do we keep the utilization high? How do we plan for a varying capacity requirements where people pop up pretty much randomly and say, hey, this is Christmas and I'm running an e-commerce app. So I need, I expect five times the usual traffic, so please do something about it. And, of course, we had, like, different types of failures that we needed to be dealing with that are inherent to distributed systems. And then there was, there is no kind of standard way of developing and maintaining the distributed system. So what data center operators ended up with was a zoo of the systems that were developed in their own way. So everyone was reinventing a bicycle. And so everyone was facing different problems. And so here what we're illustrating here is that although you have a, although the operator might have a slack in the data center, if something failed, she needed some manual action to make sure that things are back to normal. And so here is a screenshot from a real email from a very real company which is one of the largest mesos. Right now they are the largest mesos operators in the world. When this email was sent, it wasn't the case. And so what's happening here is that an operator sending a message saying, like, hey, we're good at this machine and we aren't sure whether there were any production services running on that machine and they're trying to trace the owners of the applications running on that machine to figure out whether this removed the fact that they're service or not. So of course this is not a desirable situation, not a situation that any data center operator wants to be in. So one of the fundamental and most important things that we learn as computer science students is that pretty much every problem in computer science can be solved by adding another level of indirection. There is a little kind of small text here that this is true for everything except for those problems that are caused by too many indirections, but that's outside the scope of our talk today. So enter mesos. So mesos introduced this architecture where we have a highly available master and a set of agents and a bunch of coordinators or schedulers, also known as frameworks that are running on top. And if you really think about that, what fundamentally this is, is just this is another level of indirection. And so the indirection that we have introduced here is that we have a mesos master which is a scheduler itself and it is dealing with raw resources. Here it has no knowledge or understanding of the actual applications that are running on the resources that it manages and it kind of outsources the business logic application specific scheduling and resource allocation decisions to another level which are the coordinators or the schedulers. So the idea here is that no machine in the data center should be special and we should treat machines as cattle, not pets. And by doing so we can allow various configurations of which application is used by which machine and the big difference, fundamental difference here is that these decisions can now be taken by software as opposed to a human. So in essence what mesos allows one to do is to treat entire data center as a new form factor. So before we were used to dealing with mainframes and machines and phones, tablets, what we're thinking is that there is a new form factor and new platform that we should be thinking about when we're developing our applications and that platform is the data center itself. Yeah, so as Artem alluded to in his previous slide it's possible now with mesos that mesos gives you this nice level of abstraction that allows a data center operator to view his entire cluster as one big computer. So let's take a step back. From now on till the end of the slide I will be referring to the old world where you used partitioning as just the old world and a mesos as a new world. Fair enough? So you actually have a look at all the problems that a data center operator has to face. These problems need to be tackled by an operator irrespective of whether they use the old world or the new world of mesos. It's possible though that some of the problems are much easier to tackle now but there might be some problems like a debugging which might be much more harder to tackle with a mesos or there might be a new class of problems like a mesos homogeneity that didn't even exist in the old world. So what we would be doing is we would be going over all of these points mentioned in this slide one by one and seeing how it actually compares meaning how do we actually compare our things with the old world and the new world. So let's start with deployment. So what does it mean for deployment in the old world compared to the new world? So if we have a look at deployment we can subdivide it into two main problems. You need to deploy your source of growth that is a mesos and the metashadular like Aurora and Marathon and the second part of the problem is how do you deploy the applications themselves. So there is a common misconception in the community that if you have a VMs that have already solved the problem of a cluster management it's not the case. You still need to use a cluster management software like ship or cockpit to deploy a mesos. Once you have successfully deployed the thing that deploys other things then comes the second bit meaning now you want to deploy the actual applications themselves but that is now a much more easier problem meaning you already have metaframe rules like Aurora and Marathon which allow you to do that. So roughly speaking the problem of deployment is almost the same in the old era as well as the new era except that in this new era things are a bit more powerful meaning you get a dynamic placement for free meaning in the old world when there was static partitioning as a service owner I used to identify some set of nodes and my service only used to deploy on those particular nodes but now with a mesos it's a bit more intelligent meaning it does a dynamic placement based on the needs of the scheduler so you are getting a dynamic placement for free. So moving ahead now you are able to deploy your applications but what about the dependency of the application themselves meaning if I have an application A and it has like three other dependencies and if it gets deployed to some other node how do we manage those dependencies and that's why we invented a containers right? So you should be fine with using image formats like a doc or an app scene and you should be able to bundle all your app dependencies in this container and then ask a mesos to run those workloads As a matter of fact with mesos 4.2.8 we introduced this full concept called a unified containerizer that allows you to run a doc or an app scene images are natively by natively I mean you don't have a dependency on the a doc or demon or the a doc at runtime you can just use the a mesos containerizer to launch a doc or images and it's a huge win for the community you no longer have to face the instability that gets posted on the community more often that a doc is not stable a doc is not doing this blah blah blah so you get all this stuff for free so let's move on so after a deployment the next big challenge that operators have to face is the problem of monitoring how do you monitor that your clusters are done in point in the new world relying on a traditional monitoring base goods that just to host this a monitoring won't actually work because now you have a dynamic placement of workloads meaning my workloads can be moved across instances however in the previous static partitioning world I had some nodes and I knew that these nodes were only running these particular workloads but now it's no longer the case so you need to now monitor for application and you need to adopt this new mentality of aggregate monitoring meaning as a service owner or for that matter a data center operator I want the holistic view of my world as in the aggregate reform meaning I want to know how many instances of my application are running across the cluster and it's not equal to what I want to learn I want to make some direct interaction I want to know how much is the usage slack in my cluster and if there is no slack left I want to take action as an operator meaning I need to add more capacity to my cluster the other bit which is extremely powerful that you can do with starting with a mess of 1.1 is we deduce this new concept of a task growth that are equivalent of what's in the accumulative world which allows you to launch or run child containers as part of the main container so you can have an adapter container or a side card container that actually does a monitoring that monitors the main container itself and a life cycle is closely tied to the main container another thing to note there is that the problem of monitoring is now much smaller than scope in the new world meaning as a data center operator my source of truth is a missiles and I can ask a missiles to get an idea of what the state of the cluster is in the whole world it was like a needle in a haystack problem meaning I first needed to identify what host was having this problem and then either move my workload from that host but now you get all of that stuff for free moving on, so missiles 1.0 introduced experimental support for event streaming in the V1 operator API so this primitive allows a data center operator to get a subscribe to new events happening on the cluster so currently we support 4 types of events because once our task added meaning every time a task is launched on a cluster you will get an event on a persistent connection which is the task added event every time the state of a task is changed meaning it goes from staging to a running or for that matter from a running to finish you would get another status update meaning the task underscore updated event in a similar way you also have the agent added and agent removed events so every time a new agent is added on a cluster you will get the agent added event every time an agent is removed from the cluster you will get the agent and removed event and all the events are streamed on a persistent connection there is a talk by a Zeta tomorrow on this so you guys might want to check it out it's pretty cool so moving on we now have to tackle the problem of logging so what does it mean in this new word the problem of logging is roughly the same meaning you still need to aggregate logs per application from various instances so you can use the old logging infrastructure as long as your logs get shipped to a central location in addition to application log operators now have an insight into the overall health of the cluster meaning as an operator it's as easy for me to just have a look a master logs to see like what is the state of an application if an application owner is complaining to me that their application is not running fine so the scope of the problem is reduced substantially in the new word a message by a default source is standard out and the standard error of containers in the task sandbox and you don't get any log rotation by a default which might be problematic for some operators so to address this issue we introduced a custom logging support in O.27 now an application owner or for that matter a data center operator can write their own logging module if they are not happy with the custom or the default logging solution that MSOS provides so MSOS 1.2 would introduce new debugging capabilities which allow you to remotely attach to your running container and also launch processes in it so this functionality is equivalent of a Docker attach and a Docker exit that exists in the Docker CLI so that effectively means that as an application developer I would now be able to launch a GDP instance inside my container and then see like why the performance of my container is being impacted if the need arises so the way it would be implemented is we would be adding these three calls to the agent API so we implemented this new V1 operator API and we divided it into two sub parts like the master API and the agent API the agent API is the things that you do on a particular agent so you have these three calls that you can do so the first call launched a nested container session is the equivalent of a Docker exit what it means is that you can use this call and the agent would then launch a new nested debugging container for you and the new nested container would be launched in the same nest space and same seed group as the parent container so you can actually see what is going on in the parent container and you can launch new commands if the need be on that particular child debug container and the life cycle of that child debug container is associated with the HTTP connection itself meaning if the connection breaks the process is about to be killed and the child debugging container would also be killed the other two call attached container input and attached container output are the equivalent of a Docker attached meaning you can attach to the standard input of the entry point of the container and vice versa in the attached container output you can attach to the standard output of the child container hopefully these calls would soon be part of the MSO CLI too as in the community we have been doing like substantial work to redesign the old MSO CLI we found that it had been neglected for quite a long and it had some missing features so we are actively addressing them and hopefully soon we would have the MSO CLI to be a roughly functionally equivalent to the Docker CLI meaning you would get all the cool stuff like a Docker PS4 free which you can then use on the MSO containers so hopefully it would be pretty useful for all the operators and developers okay so now let's move on to ASOS homogeneity now this is a problem that didn't exist in the old world at all so what this problem means is now I have a large cluster and not all nodes on that cluster are equivalent so you might have one a beefier node and you might have one a fray node and one CPU for a healthier node or a beefier node is not equal to one CPU from a fray node and similarly one memory unit is not the same in comparison to some other node so as an operator you might want to tag the resources on the agent with labels so that it gets passed on to the scheduler and the schedulers are able to make a good decision on behalf of it and we are also thinking about introducing type a resources concept in a MSO itself meaning instead of a CPU being just a string it would actually make it strongly tied meaning it would be like a CPU info and more information about the resources itself moving on so in the old era you had these VMs that would actually grow and string as per the application needs but now you have containers and you get like stronger isolation so your container would be killed if you exceed your resource limits so this is like a fundamental mindset change that a data center operator needs to tell the application a developer that you need to size your containers beforehand or they would be killed which is not like... when I started out it wasn't that obvious to me because when you have been programming in the VM world you just assume and take these things for granted so it would need like some kind of a mindset change that we need to pass on okay so planning for failure so this is an email that was sent by a database administrator to a data center operator and what the DBA is saying to the data center operator is hey I have to launch these two new database instances and can you tell me if these two instances are in different world domains meaning they don't belong to the same drag, same switches and power fields the thing which is really unfortunate about all this is it should be a software which should be intelligent enough to take these placement decisions on behalf of you and not human beings because human beings are prone to error it should be a software that should be intelligent enough to handle all these poor domains and make an intelligent decision on behalf of you so how do we handle host and rack failures in the new world failures are the norm rather than the exception when you are using a commodity hardware so if you are getting paged or due to host or rack failures inherently you are doing something wrong meaning you are using some kind of static partitioning in which you are pinning your services to instances don't do that in the new world the new world is supposed to make your life easier not harder to deal with you would like to work with service owners to ensure that they have proper fittings on the load balancers so that you can handle and survive these host and rack failures and of course have some spare capacity to get around these failures if you don't have spare capacity and your rack dies you would be in a pretty bad state so network failures always use the agent, a removal, a rate limit if you don't use it and you have a network partition it's very likely possible that you might use all the agents on your cluster because the master does health checking with the agent and the default value is 75 seconds so if your agent, if partitioned away from the master for 75 seconds the master will of course fully kill it if it returns back so ensure that you have a proper value for the agent, a removal, a rate limit or you might use your entire cluster okay, so historically a meso has sort of defined a fixed policy for dealing with a network partition meaning if I am a framework and I have a task which is running on a partition agent if that task gets partitioned or if the agent gets partitioned the master would send task underscore lost on behalf of me so this is really problematic because as a framework now I have no way of determining when a task is definitely not running compare it with this scenario suppose I am an operator and I am updating my cluster so I of course fully terminate an agent it would fail health checks and the master would still send task underscore lost on behalf of me so now as a framework I just don't know whether if the task is not running, maybe it's running and how do I go about handling this so we introduce this new capability in a 1.1 that allows a framework to opt into this new partition aware feature so as a framework if I say that I want to have this partition aware capability the master would send these additional tasks as status updates back to the scheduler so in the new world if the agent is partitioned away from the master the master would first send a task unreachable to the framework if in the future the agent comes back again the master would send a task running for the task if it's still active this allows a framework to now make sure so that it does that if it gets a task on it's sure that this task can't actually come back again so I remember some slides back I actually showed you that email about a database administrator emailing Dan to get around those use cases this was introduced a maintenance of primitives so a mess was introduced a maintenance of primitives like long back I think it was in 2.1.25 so as a framework I would be extremely mad if a data center operator takes down an agent accidentally I might have workloads which are critical and my SLS would impact it so to get around this a mess has introduced this new a maintenance of primitives which allow an operator to effectively say to the a mess of master that I want to take down this agent for a maintenance what does this mean is if you see the figure initially your agent is in the up mode meaning it's active now an operator can say I want to schedule maintenance on this agent as soon as the operator does that the agent goes into a drain mode when the agent goes into a drain mode the master sends inverse offers back to the framework indicating the unavailability of the agent so as a framework now I can decide based on the unavailability if I want to move my workloads from this agent to some other agents so this effectively means that now as a framework I get the choice that I can opt into this meaning there is a way that taking down an agent might not impact my SLS as it used to do in the old world but this I would hand it over to Art and Moore in the end of the round Next thing is the capacity planning where before we had to manually partition our cluster and decide which workloads are running where today Mesas provides a quota mechanism where an operator can set a certain quota for a particular role and then Mesas will ensure that that framework that's subscribed to the master as that role will get that much of resources ahead of everyone else similar to quota there is a mechanism for reservations which works the same way except that reservations are tied to a particular agent whereas quota is across the cluster and in case of quota it's Mesas that makes the placement decisions for the framework the last thing that's new is the persistent volume so this is a mechanism for creating persistent volumes from disk resources and the idea here is that the lifetime of those resources that are created when launching a task actually exceeds the lifetime of a task itself which means that now we can't have for running something like Cassandra and for whatever reason our executor terminates Mesas will hold on to the volume that this Cassandra framework has used and created and it will offer it back to the same framework when it resubscribes so these primitives allow us to run stateful workloads on Mesas things like databases so these are the four principles of the full tolerance as prescribed by Tannenbaum and we believe that Mesas checks all of the boxes here so it allows you, actually it mandates the way the primitives are presented to the user they kind of mandate for the framework developers to build systems that are highly available are safe in terms of they prevent operator from doing mistakes to the extent which is possible and are easier to maintain so you had to take two things away from this talk so that would be that no machine in the data center is special you should be able to use all the data center resources interchangeably so unicorns are not always a good thing sometimes we won't just ship and the other thing is that we should outsource as many decisions to software as we can so we have to let software schedule software, handle software failures and then take care of utilization and also help us when it comes to maintenance thank you