 All right, today should learn some Chinese, but I don't know anything yet. My first time here, so it's quite exciting. Just to make sure, I'm going to try to speak slower just so, because I suspect there's a lot of non-Ava speakers. But I do have a lot of material, and I do have a tendency, my normal tendency of presenting is actually rapid fire. So don't hesitate to slow me down, say, to do this. That's right, stop that. I'm assuming I'm being translated, but I don't see anybody with, oh, there's one headset. Okay, so there might be a delay there. This talk. But first of all, I'm Ken. I guess I should click this there. Are we on? There. All right, gotta hit the right button again. I'm Ken. I am one of the distributed application engineers at MesoSphere. I've been with MesoSphere in over two years. I'm an employee of 22. So prior to coming to MesoSphere, I was heading into research and development at a company called Sattice, which was part of CenturyLink in the U.S., that is, the third largest size peak. And they asked my team to figure out where's the cloud going to be in two to five years, which is funny because now it's two years later, right? And humans are predictably terrible at predicting the future, and it's worse when you start throwing in technology. But the three things that the team that I was on determined would be game changers. One was Docker, and at the time it was 0.3, version 0.3, so it was pretty early on. The other was CoreOS. And now there's more, that was the one that's kind of a wild card. I'm not sure. They still have a pretty neat thing and ranchers here. There's a number of things that might work out in that space. But the other one was Mesos, Apache Mesos. So that's me. And I work on a lot of different, I'm a Apache Committer. I'm a Apache contributor to Mesos. I'm not a Committer on that project yet. And I mainly work in the space of Java. But I say that the current project that I moved to six months ago was in Scala. And what I've been doing for the last three months pretty much is Python, which is my first Python project. It's been interesting. So don't laugh at my code. Oh, so I meant to say this. This is going to be a technically deep talk. So hopefully you're okay with that. There are some challenges. You may have an hour. And there's a bunch of stuff that you'll need to know in order for the shakedown to make sense. So I've got a good 20 minutes or so of kind of intro, which is kind of a deep dive into Mesos and how Mesos works a little bit. Who in here is using Mesos? You're actually using it. Okay, great. Who in here is using DCWes? Okay, a smaller, yeah. That's a hat mask. Okay, I'm going to talk about Shakedown. It is a testing tool. It's a testing framework for DCWes. But the reality is almost everything I'm going to show you to be completely used on Mesos, all right? There are some things. I'll throw in the caveats because there's such a strong number of people who use Mesos without DCWes. I'll throw in some nuggets out there as to what might not apply. There are some pretty strong differences. I'll also show you some of the infrastructure. Now, I have one challenge. It's a pretty significant one. I had fully intended to be up here with my laptop and actually doing demos live and looking at stuff. I am going to go through slides right now, these slides. I am then towards the end going to go back there and probably demo stuff. I'll be on mic and I'm hoping that you're staring at the screen. You don't need to see me, right? But I actually want to show awesome stuff, like actually do it. I want you to have that exposure. If I had one hope or goal out of this is that you're actually taking a look at Shakedown and making some improvements, maybe sending some pull requests and helping to improve things. If you want, you can also send in some pull requests for your frameworks or links to your code and I'd be happy to take a look at it from a testing perspective if you'd like, all right? So the core of it, I've been working on Mesos for a long time and schedule it for a long time prior to being on the marathon team which is what I'm currently on. I was working on the agility team which means Cassandra HDFS was my main thing for about six months. Cassandra HDFS, Spark, there's a bunch of them. Oh, Kafka, I don't know if they're important, right? What is important is that if you take all of those frameworks, all of them, they have a certain nature and how they work with Mesos or with DCUS and we're going to talk a lot about that, all right? So I'll get going. So Shakedown, what is it? So Shakedown itself is a project which is, you can see the quote up here, it actually comes with some kind of background to it, right? It's the thing we used to do to ship some airplanes, we used to put it through a test before putting it on its maiden voyage, right? Now, even though I'm from the U.S., I've traveled and speak on different subjects and technology all over the world. It's the first time in Asia at all, and it is amazing here, by the way. I want to say that Beijing's airport, I've never seen a bigger airport. I've been all over the world and there's a bigger airport. It's amazing. So that said, what it reminded me of, I was just in May, I was in Stockholm, Sweden, and I don't know if you've ever heard the story of the ship Vasa, but the Vasa ship was created by the king of Sweden years and years ago and he was in battle and he wanted to show off, he wanted to prove he was bigger and battered than the other kings he was fighting, and they created this ship. This ship, on its maiden voyage, about 20 minutes into it, sank to the bottom of the sea, and now they've recovered it and it's in the museum in Stockholm if you ever want to go there. It's a fascinating story, but out of all of that, there were two things that were huge drivers into this ship being a failure. One, there's a number, there's a huge number of them, but the two that stand out relative to our talk is that they had a high rate of change, a number of changes in the number of cannons they wanted to put on board, a number of just a bunch of stuff. But the other thing was testing, testing. So we're going to focus a little bit on testing. Here's our agenda for the session. I've never done this before, so we have an hour. I have no idea whether I'm going to go over or I'm going to go under. We'll see, right? But feel free to ask questions. So testing, the testing in general, but now I'm making a couple assumptions and these are big. One is that you're developers, like this is a developer talk, right? So when we get into testing, what do we mean? Well, typically there are three different types of tests that we're thinking about. Unit test, integration test, and system integration test. That last one may be kind of an odd one to you, but that's the thing we're going to talk about and it's significant. So when we look at unit tests, what we're talking about is you're writing code, you're dealing with a unit of code, you're writing it in Java, usually. You're probably using J unit. If you're in Scala, you're using Scala tests, right? There's a number of options here. They're usually fast running. They're usually not connecting to any external systems. There's a number of characteristics and we expect this to run every time someone's going to push code, you're going to run this prior to or you're going to push it and it's going to run on the CI environment, the continuous integration environment, right? So that's the characteristic of a unit test. Now, what can be confusing is this integration test. Integration test, still, it's in the code, it's still in Java, it's in the language of the project almost always. The difference is, it's not just a small unit, it's a collaboration of classes or a collaboration of objects. But often, what we're wanting to do in an integration test is integrate with something else, right? You may have a fake database. You might put in memory database. You're not using a real database. You're using something that mocks it in some way. This is an integration test, right? These tests can be slower and oftentimes the slowness of these tests is due to data. You have to establish some data, you do some changes, you run your assertion and then you want to back that data back out and start over again essentially, right? So it almost always has to do with data or it could be connectivity, the latency associated with certain things. Now, this next component of system integration test is where we're going to focus on but it's really strongly different. In this environment, we are actually putting, so let's be very specific, we are putting a masos framework into a masos cluster and we're testing it. How does it act? What's going on with it? How does it respond to failure? Those are the kind of things that we're going to be looking at and probably a better way of looking at the three comparisons here with the spreadsheet or the table and the biggest one is the time. We expect things in unit tests this is probably too high, it should be fast it would be sub second, right? This should be fast. This is slower, can be, it varies, it's usually due to mocking and data and this is really, really slow and in fact sometimes we don't really control a lot of it. We don't control any of this. I run a test all the time on Amazon and there are things that you normally get done in seconds and they could take 10s or 30 seconds sometimes. It's so widely varies so you have to be prepared for that. That's a strong, strong difference. You don't have controls like you have in an integrated or a unit test environment. Some big ones is that you have network access, right? Whoops, wrong direction. We have network access, we're using external systems and it requires a whole cluster. So those are pretty significant differences. These buttons are backwards when we run them. Okay, some things I want to re-emphasize. When we're on a cluster it's a real world cluster everything in mesos is asynchronous. That seems to be foreign to people who are very new to mesos but everything's asynchronous. When you register and you get a response back like literally when I make a socket connection and the return of the socket or the response means nothing. I hope it got there. There's some level of guaranteed delivery that TCP provides me. There's some level of guaranteed messaging we get with mesos unless you're using framework messages which isn't guaranteed. It's asynchronous. The only way that I know that a registration occurs is a callback occurs. Again, it's asynchronous. Everything in mesos is asynchronous. It's really hard. You need to learn about it. You need to be in that world. You need to be in that thought that mind thought. Let's talk about DC ones. What I really mean is mesos. I'll provide some differences here in a second. The first is if you're not familiar to create a mesos environment we have some mesos masters and here we have three at a minimum in depth and I'm not doing shape down. I have one because I don't care. Unless I do. Sometimes I'm literally testing what's going to happen if a mesos leadership change occurs on my framework and that's the thing I'm testing. In that case, I need three probably. The magic number in production is five but it varies. One, development. Three is the minimum HA high availability cluster. The most number of nines. Hopefully everybody's aware of that. Anything I'm saying in here you can come and dig into more details but that's what we recommend. Beyond that, there's a quorum of zookeepers that manage that whole mess. If we're doing shape down we'll test it in a whole cluster. Typically I have a zookeeper, I have a master, I have a set of agents. We change the name of slaves to agents and it can be any number of them. That's why I may need a varying size cluster. If I'm testing HDFS it requires five agents. I have to have five. Three for three journal nodes and I need two data nodes so I need five. So you have to be in the mindset of what is your application of your framework doing in order to know what size cluster you need. For the tests that we're going to do in here I've got a two node cluster. When we're talking about DCS you also may be testing a public node versus a private node. And so those are the kind of things that you might be interested in. And then we have our framework or our scheduler itself. Again, I'm making an assumption people know what that is. I'll start diving into some of its behavior in a few minutes. Okay. So that's the core of Mesos. Beyond that we have to talk DCOS and this is the infrastructure that we have when we install a DCOS cluster. In the DCOS cluster we have some things that are super important to realize that are outside of the Mesos world. The first is that we have this admin bubble here. We call it the admin router. There is essentially an engine X process that is monitoring things and handles all incoming requests. For any scheduler that gets registered appropriately in DCOS it will not only be a scheduler on Mesos, it will also be registered with a service endpoint in the admin router. And that becomes the entry point for one RESTful APIs and two is the entry point for our CLI, for our command line. So that becomes super important to us is the one thing that we are going to be testing against in here as I start to show off some code in a few minutes. So the admin routers becomes very, very important. Also if you're not familiar with DCOS as a whole you can go to this master and if you just went to it with nothing else in the URL you'll go to essentially DCOS, you'll get the DCOS you want. But you could put Slant Mesos and you'd see Mesos. You could put Slant Marathon and you'd see the Root Marathon and you could put Slant Exhibitor and you would see the Exhibitor access into the Zookeeper. So there's a number of things that aren't often advertised that are going to be useful to know in here when you're dealing as a framework author and you need to understand what's going on inside the cluster especially that's necessary when we're doing shape demo type tests. So we have this master and it has an admin router. You can see that we have a web interface to that. There's also a rustle interface into that. There's a CLI interface into that. That master then will communicate with two types of nodes. The first is essentially here labeled a worker node and we have one in the public and one in the private space. This is where the lines here in your cluster size will be. This will probably be one node it could be more and varies on what you're trying to accomplish. Private means private though. So the network connectivity for the master is the only thing that's talking to these nodes and these nodes can talk to each other and it can probably talk to the public but no public traffic will come to the private nodes. That is the design intent of that. Okay, so let's talk about how Mesos and DCOS works. And mainly at this point we're talking Mesos. So the first is again, if you are creating a framework for Mesos there are two components to a framework. One is there is the scheduler and two there is an executor. Almost all the stuff that we're going to talk about in here is going to be scheduler oriented. We are probably a little bit weak in the executor in testing infrastructure at this point. So if you see some things that you'd like to add, please tell me I'd love to get the to-do list of work on over the next month. The scheduler. The scheduler's job is to register with the master. It will usually find it through a zookeeper lookup. It doesn't matter you can literally tell it what the master is and they can find it. It would then do a registration. Thing to note, very important, you can see here that the scheduler it's going to send a call to the master it's asynchronous. The scheduler, your scheduler this is your work, right? I'm going to give you an example of Marathon or Chronos or Veteranome. I'll have a number of examples in here we'll look at but the idea is that you have a scheduler right? So if you have a scheduler the one thing to know is when you register with the master you should have a timeout. And that timeout is, wait a minute, I registered with the master and I haven't heard back and it hasn't told me I'm registered yet so what's going on, right? It's all asynchronous, you don't know eventually the master will call back and say you're registered it's all asynchronous. So those are the kinds of things that you could test in this example we're actually looking at an offer that's coming through. So essentially all these agents will be talking with the master. The master will have essentially a list of resources that are available across the cluster and it will periodically send an offer request to the scheduler and it'll say hey there's an offer available so that's the way this works, right? It's very if you're very new to Mesos this is quite different than say Yarn or some of the other schedulers out there it's not top driven, it's not a scheduler that says I want this it's the other way around, it's Mesos the master that comes through and says hey I've got agents that have offers available this size do you want to use them that's the way it works then you can either hoard them these offers which is not something I would encourage you could decline them or you could accept them those are the three things that you can do with those offers. Now there are schedulers for which hoarding is a useful concept where you need so much of the cluster to be able to do some kind of work. They are rare but that would be something interesting to test, right? Those are the kinds of things that you're looking to test if I send an offer to the scheduler and the offer is too small it shouldn't give me a decline and those are the kinds of things I might be testing and then we get to the rest of this with the scheduler but it becomes less interesting let's walk through what that looks like again all I'm trying to emphasize here is the asynchronous nature of how Mesos works so first we have a Mesos master that will register with a zookeeper probably a quorum of zookeepers notice a couple of things agents come on and they register now notice that first of all here we have an agent that came in zookeeper lookup and then did a register that is a one-way call it's asynchronous so the next thing that happens is the master will come through and say yeah you're registered cool it's all asynchronous right if you don't get that call back then the slave is not active Marathon I'm going to look at a lot in here mainly because I've been working the last two to three months on shakedown tests and since that's our topic it seems easy to show what I've been currently working on as examples but Marathon will come in here do a lookup right here it will do a register notice the same thing occurs the master will come through yeah you're registered this is the same kind of activity I would expect with your framework when you create a framework you're going to register you're going to get a callback that says you are registered and then a bunch of offers come in now you have to decide your logic is to know what you do with that Marathon declines all offers coming in it just keeps declining why? because nobody's ever told Marathon what to do yet it doesn't have anything to do all it does is what a user tells it to do so along comes a client and it makes a request for an application to run now it has an objective and the objective is to land this thing on an agent that has a set of resources defined by the job so we can say that's the definition of the job that it needs two cores that it needs four gigs of memory or four gigs of memory when the next offer comes in we will ask the master to launch that notice who's talking to do at this point this is super important because this is literally what's happening it will be a scientist or an agent that matches that it will be told to launch an executor the executor as it might all frameworks have a scheduler and an executor if you don't have a defined executor the default executor will be used so there is a mesos executor that will be the default it's going to launch some activity it will be asked to launch that it will launch a task notice a couple of things that might be useful mesos is interesting in that this is we have this thing called a task that's this thing and the reason why it's not called a process is because the task could be a thread or it could be a process so your framework is inside what that is by default in this particular case it's going to be a process it's going to launch that process also important to note that this thing that there is a slave I want to be very specific about things there is a node let's say that node is defined here there is a node the node has an agent process running on it has an agent process so the kinds of things that could be on the agent node it has an agent process that could die or bounce it can have an executor that dies it could have a process that dies those are all things that are important to us as we start to test these things out so as a quick example there are some status updates that come filtering back as well now the thing that's really interesting about a status update there are two types of messaging when you're dealing with a framework or a service inside of ECOS or mesos there are two types of messages that may happen one type of message is this it's a status and the status messages are guaranteed delivery so if there's any kind of network failure that happens along the way as soon as the master if it died and then came back it would eventually receive this status update it's guaranteed to happen so there's another kind of message that we have which is a framework message framework messages are not guaranteed now they're using TCP so TCP is guaranteed delivery from point to point from a networking standpoint and that's true but what we can't say is that at the track if the packet went across and it made it to the master and at that point the master died I can't say that the master actually acted on the message that came through which is very strongly different than with a status update because a status update if it wasn't acknowledged at an application layer not at a TCP layer it will be reset until it is acknowledged those are very important to understand when you're dealing with mesos so eventually this goes back all the way to the framework and the framework then the cyber is going to do, in the case of marathon it logs these things it manages this information inside of Zootper and currently that's what it does I think we're going to move away from that but that's what it does today so when we do a task failure we have something that failed here we have some status update that happens it goes back to marathon so this is the thing to expect there is a relationship between the agent process and the executors that it has it's talking to them and all of a sudden they can't talk to it anymore or the executor can't talk to the process anymore any one of those things cause a rippling effect to have a status update go to the master and then to the framework that owns that task and acknowledge the fact that that doesn't exist anymore that it was killed in this case it's a kill what marathon is designed to do now this will be up to you your framework as to what you will do here but it seems logical that you're going to relaunch that thing so in this particular case say we relaunch the thing if you had there's lots of reasons it might still land on the same node that it's on it will then relaunch that whole process and then you get status updates coming back all the way back into you and that is exactly how marathon works alright any questions about that that's important to moving on to the testing we'll dive into the testing now I'm going to go you ruined my joke I was going to say I'm not following I'm not following move ahead to take it off line I don't want to get too much of the leads and that seems specific to the what I'm hoping to do is cover how the framework and the messaging works because we're about to test that that might dive a little bit deeper into a specific subject which I'd be happy to cover by the way so is that good I'm going to corner you after we get done this talk I'm going to answer your question alright so let's talk about shakedown we understand testing we understand that system integration testing we understand how DCUS or MESOS works we're going to put it all together so the first thing is we can go out to the project side important things to know about shakedown one it's Python so you're going to be writing stuff from Python if you're using shakedown you're using Python two it's using Python 3.4 so the default is 2.7 still which is label but that's the truth we're using 3.4 there's a lot of projects within the MESOS sphere world that is moving to 3.5 so I don't know if that will change here or not I've gotten used to using if this is due to you I've gotten really used to using Pi ENV which is similar to an RB like a Ruby ENV which allows you to manage multiple versions of Python which has become necessary in my life lately so be aware of that it's Python you're going to install it simply by this now if you have a cluster already created if you're in your own private data center you of course are going to need to have access to PyPy the ability to install Python libraries so that will be a minimum we're going to do a a different install of DCUS Shakedown now if you're a developer it's a little bit different potentially at least it is for me it's a big clone of the Shakedown project which by the way is an open source project on the DCUS name space yeah I will do a pip install of . pip install-e. and then I'm installing whatever the current latest and greatest is so those are the two types of things that I might do some things to know when you're getting started the first is that you're on your own to create the cluster and that doesn't do anything to the cluster creation now what Shakedown can provide is as soon as you have a cluster it can provide like package installs of things you can do that through Python sometimes they do and sometimes they don't sometimes they do the installation external to running the Shakedown so I don't know if I have a reason why I did one versus the other but the framework of the toolkit provides the ability to do things with the universe and again probably one of the things I left out of DCUS conversation the whole goal with DCUS is a data center operating system so it sounds a little bit weird you have an operating system for the whole data center but what does that mean? what do we mean when we say operating system? when we say operating system it should have a file system it should have a packaging management tool like an app can install or a young install it should have a list of things we expect an operating system to have well with DCUS it's a little bit different in that you can't say it's always X like with the file system it could be S3 or it could be cluster or it could be HDFS those are all valid ways of storing things in a file format type but with packaging we call it the universe so again another open source component where we can as brew if you're used to a Mac if you're used to using brew home brew with a brew install it's a similar packaging mechanism I'll show you in a second if that's more into it so DCUS cluster you've got to create one the second is you've got to configure a shakedown so this is an install of shakedown if you don't already have it there's some configuration options I'm going to show you in a second and then you just need to run it at that point you can just run it and I'm going to show you some examples of that as well hopefully we'll have time to actually show you demos this is just doing a shakedown it's really quite new is that me? could be me it's kind of a neat effect though it was pretty cool oh you turned me on some of these are really quite new let's see here you can see we can pass the SSHP we can have standard out I tend to like standard out so newer stuff we started adding some authorization type things recently so those are both don't mind looking at them if you want that is just doing help how do you run this thing? well typically we can do something like this now I've added in some unit style type things just because they were there right you'd say shakedown it used to be not that long ago in fact this just changed within the month it used to be you haven't tried this this is a required flag it always had to be passed was dash dash DCUS URL now it turns out that as a developer DCUS URL already configured with what I'm working on so I'm just saying give me whatever it's already configured within the last month we're like you know what just go find it if it's there just find it now the thing that's strongly different is if I'm running on my local box I always have a DCUS cluster installed or ready to go almost always but if I'm running in a CI a continuous integration environment I may not so you don't have to have one configured in your CI environment you may pass this in what else you can see SSH key you don't always need that you need it whenever you're running commands on a node or if you're transferring files there are and whenever you're doing network partitioning we have some functions that provide network partitioning when you do that it's sending over a file executing a command it's actually changing the firewall rules so you would need a SSH key in order to do that I say that because I have a cluster creation tool that I'm happy to show you but it's internal mesosphere and we haven't provided these people it's for our own purposes it auto provisions on Amazon with keys that all of our developers have which makes sense when we're running stuff on GCE or on Azure which we do we're still trying to figure out how we want to manage that we're also working with clients that are working on a private data center for which they have no access to things you would need a provision nodes that support SSH if you are wanting to use this capability outside of that the big thing here is a bunch of expert flags that I'll show in a second but this right here this last line here test system, test, network part that is the test suite that I'm looking to run with that line of execution so that's an example of what I would be running in a CIE environment to do some network partition of whatever it is that I'm running in this case it's going to be marathon now we kind of clean some things up make things easier you can see that in the previous slide there's a bunch of stuff like SSH, dash, dash, standard out like I want to see the standard out some people don't, I do want to see what failed, why it failed I want to see what's going on so we can create all that by in the dots at your home directory by adding any of the dash dash flags you can just put that in that file and it will automatically be the default so it reduces the command line something much shorter the one exception is the DCS URL that is not in this file but if you have it on your local environment it will automatically pick it up so just be aware of that and then the next thing to note is that if you're on a project and I want to show you marathon in a second if we're in a project it will automatically discover tests so if you say dot it will look through all of this directory and all sub-directories until it finds all test star.py if it's teststar.py it will find it and it will execute it okay, that's important to note you could just run a specific suite in this particular case I'm saying I want to run just this file and whatever is in it now it could have a lot of tests in it but that's one suite it's not running all the files that are in there it's not discovering anything last I could run actually a single test so I could say this is the test file this network.py colon colon in the past you may have used pytest before pytest naturally has colon colon and then the test in the test suite we're just doing the same thing that pytest has always done so you can say colon colon and then you're going to run not this little test suite but just this test inside of that that's the only thing you're going to run now there's one last thing that's not up here that's useful I think to me anyways and that is you could have a file that is a test file that is not labeled test now if it's labeled test it will auto discover and it will run as part of the suite like this right here it will run if a file is called test underscore scar.py it will be discovered but I have to have some things that I like to run occasionally through shakedown that I don't want auto discovered and the one that comes to mind that I'm happy to show you if you're interested is over provisioning I don't tend to over provision things but sometimes I want to so my default installation on Amazon for me is an m3.x to large which essentially is a 4 core machine sometimes for certain tests I'm not testing the nodes itself I'm testing how marathon is working so I will tell each one of the nodes and I'll show you some code it will cycle through all the private nodes and I will essentially adjust and tell it that it doesn't have 4 cores it actually has 100 cores right and so it gives me a cluster that's large and sometimes it's exactly what I'm looking for but it's not something that I I want to choose one that happens essentially that's what I'm saying so I have a file that's not it doesn't have a nomenclature of test as part of the file name but if I tell it through this test system over provision.py it will actually run it I find it useful and then inside of a py test it will look like this and this is actually one of my examples of anything that we actually run so I have a def I've defined a function it's inside anything that's a function that has test as a prefix test anything else was intended to be supportive in nature it doesn't have it's not run unless some test executes it there's a couple of exceptions to that but we're going to see them in a second test UI available I'm checking here with just an HDD get now there's a couple other things in particular DCUS that need to add again I don't know maybe 9 months ago you can't just talk to the admin they have to be authorized to talk to the admin router so there's a little token that gets passed around so if you were to do an HTTP command line or do like a crawl request at the command line you'll get like a 503 you'll get some kind of redirection or an unauthorized error or something like this right so the important part here is what I'm doing at hp.get here this is in the module of DCUS I have HTTP it will automatically send that for you so it's awesome it's really awesome I do all kinds of tests now just with either ShakeDown or Python alone using those libraries because it's frustrating it can't be frustrating, I'll show you what I mean if you're interested here we're just doing a simple request to make sure that the DCUS a service URL of MarathonUser that is the default service name Marathon being installed on top of Marathon we commonly, internally in the company we call it MAM now Marathon on Marathon M-O-M and sadly there's a place in my code where I kill MAM and I'm sorry but we wanted to see what happens when MAM comes back so terrible and we're assuming that we get a 200 so if we get a 200 back then we successfully got an HTTP OK and things are there if not it's not and so that's the thing we're looking for typically if you're inside of that file once again if we're looking at I'm sorry if we're looking at that as a test what you normally would have at the beginning of the test at least are these two things you're going to import everything from ShakeDown you're going to import things from DCUS and you don't have to you certainly have to for ShakeDown there's a bunch of ShakeDown things you're going to want to import from DCUS or not I'll show you a little bit more details as we get to the demo a couple of functions I mentioned there's some functions that are that if a function is labeled or starts with a word test it will automatically run and everything else is just like supportive there's a couple of exceptions that are listed here the first is the setup module so if we do a setup module we want to think setup test class this is the test suite this is the thing that will be run for everything that is inside that test file so it's called once for that test file we have setup function oddly the person who first created ShakeDown, Scott who's awesome by the way quite good way of doing this but I found it funny I needed this and he was shocked that it worked there is a death setup function and you literally pass in function as the parameter that will be called as a setup for every test that is something I might do using marathon where I make sure that no other applications are currently being deployed before I start the next test or something like this and then lastly we have the tear down module it is the reciprocal of this it will be called once and it's whenever that file is done running it is I have started doing some scale tests I'm not happy about how I'm doing it quite yet but I'm happy to share what I'm doing I'm running some scale tests my preference is to give scale or performance numbers out to an external source like we're thinking about going to data.org right now it just prints out a comma delimited which I in common will paste into a spreadsheet that's what I'm doing now the output of that comes out of this out of the tear down module it's when the entire test is completely done and you just want numbers at this point and there's the helpful stuff when you're getting started with shakedown the hardest part is what are all the resources I need to know enough to get started now I've already given you a lot install it you're going to have to go Python and there's some Python tricks I'm going to show you that are useful in particular the context manager and they'll realize I'm new to Python so you guys there's probably people in here that know Python much better than I do come teach me I'm happy to learn but the context manager is great I'm going to show you an example of that with marathon and marathon go to the API page we keep up to date with the API page whenever we make a shakedown function we'll add it to that we've been pretty good about maintaining that documentation so far as management and it's not too much we'll take a look at that in a second now oddly for me as a user whenever you bring down shakedown you also get the DCOS CLI now I've started using the CLI internal Python code so the CLI the DCOS CLI is written in Python as well so I'm using the modules that the CLI uses inside of my shakedown and I'll show you some examples of that it is why I said from DCOS import start or reverse error import start from DCOS the reason why is because I'm going to actually use marathon directly from DCOS it's super helpful to me so if you were wondering what are the things that the marathon module can do in Python unfortunately it's not documented you actually have to look at code but it's super easy when you look at the code just go to the DCOS CLI and it is an open source project so you'll be able to find it the last thing so if you put together all these things you should have enough information I realize you probably want an example I'm going to show you some I'm going to tell you where to go but outside of that you might go outside of those examples but no, there's one more additional thing that's useful often times where you're getting back from a function call is a JSON object but you may not know what's that structure what does it look like what are the things that are inside of it once again very little documentation on that my favorite place to go to is to go to mesos or go to marathon and get the JSON directly from it source of documentation I will show you that also sometimes you just print out that JSON and go okay that's the thing it is now here's the thing I'm looking for that's the way to get past that this page will be all the resources that you need to know to do a lot of work this one page this is the resources outside of seeing an example if we look at shakedown and what it does today one packaging so I mentioned the universe before the thing that controls the universe is something called cosmos inside of DCOS those two things work together you don't have to know much about it all you need to do is call some Python libraries it's super simple an example install package and there's another method called install package and wait waiting is waiting until the package is fully installed now for river this is why I emphasize so much that everything's asynchronous I just recently personally haven't so much wait code over working on marathon over the last few months that I recently put in all of the wait for events in Shakedown so that was just pushed two weeks ago or a week ago but there's always something you're waiting for so you'll start to see some code cleanup and some consolidation and a lead we'll do some command and file copies you'll need SSH to do that waiting for events what are the things you'll wait for I'm waiting for an appointment I'm waiting for a service endpoint to materialize I'm waiting for a DNS to actually come into existence I'm waiting for a DNS name to go away I'm waiting for a service endpoint to go away I'm waiting for a task to be launched those are the kinds of things that I might be looking to wait for I'm not waiting for that now I'm checking to see if something is the way I thought it should be or I could kill it kill a process so you can see the combination of things I'm going to launch something maybe marathon on marathon I'm going to launch an application on marathon I'm going to wait for the task to materialize when the task is there I'm going to execute a kill command on the process that the task died and that it was relaunched and relaunched with the same constraints that were there before maybe I had a constraint to the host maybe I didn't have a constraint at all I may just verify that the task-hating changed those are useful things to know by the way if you're a framework author this is the place to be if you're a framework author never reuse your task-hating it's a common thing that I see that there are some problems that come with it I could go into details but not right now fault injections, killing processes, restarting it should be like kill the agent kill the master by the way a master recovery takes a very long time it can be frustrating so you want to be aware of that but sometimes it's useful to know what's going to happen we recently had a bug in marathon within the last three to four months where we had task loss we had a customer who was very important to us very large customer and they had some real challenges in their data center where they had network partitioning all the time which is confusing to them but they had it all the time and so they would have this task loss and so we fixed that in marathon we wanted a shakedown test to confirm but it works the way we'd expected it to do so that's what we spent some time doing and lastly I just added as well some firewall rules where we can actually adjust the firewall rules of a human node now, just in the firewall rules you can't actually break a network right? so we do that virtually I literally go in there and stop ports so I'm like when I say that I'm going to kill my connection to zookeeper literally I go change the firewall rules so that the 281 is not accessible boom, no zookeeper now this is the reason by the way the reason that I refer to you now is root marathon that's the marathon on DCOS and I had mom, marathon on marathon one of the reasons I will create a mom environment is because when I kill the zookeeper mom will always land on an agent it's not on the master it's really really hard to control the master like the root connection on a local host environment so you want to be aware of this thing I also will comment like the things that you'll see in shakedown the ability to pull the IP address that mom landed on I go through all the other agents and find an IP address that mom's not on so that I can constrain a test that mom launches such that it won't land on the same node right? so these are the kind of things that I would tend to do in code I can see that I'm going somewhat long so we'll be somewhat brief on code I'll actually want to look at code so if you want to hide a file on each of this just go in a test underbar so here we have a non-discoverable pie test and that only runs when you're explicit so I'm going through a few tips and tricks here working with marathon I mentioned before I actually used the DCOS Python module so marathon.free client this is a JSON right here that I pulled you can see that I sucked it right off the file sometimes I'll create JSON right in Python it varies on the size sometimes the JSON can be so long that it looks terrible in Python so I'd rather have it in a JSON file so it varies here you can see here I've pulled an app from I don't know what I'm doing here I'm just going to find that the user is none so I'm asserting that there's no user to find on that application and then I run it and notice here I'm doing a kill so that's the long-winded way of killing eventually makes some improvements there I mentioned that it's useful to use the context manager especially if you're in a mom situation so here in code I'm just checking to see that this will run with a good user in other words a user that's defined on the network so I have a good user test and a bad user test but it happens if an app has a user that it's supposed to run with and that user doesn't even exist in the cluster so I have both tests here I say live marathon on marathon so for the context of everything in this code from here to here everything that's within this indented code will be the creation of marathon right here, this client is on mom not on the root one if this code here was here it would be on root but since it's inside of this context it will be on mom so those are the kind of things that I would do is I have a marathon on marathon and then I will be within the context of it now all work in here is being worked on with the marathon on marathon I mentioned some of the things you might be waiting for if you wanted to create your own weight point the spinner that I created is fairly abstract all you need to do is create a predicate when we're creating a predicate here you can see a task CPU predicate all we're checking to verify some code here is the predicate and here's me using it I pass in a lambda and the whole point of this is I don't want this function to call right here I want it to call only within the context of this function so this is somewhat on the advanced side functions into a function to be executed within the function hence the lambda so just be aware of that some of these can be somewhat complex but again there's already most of the things you probably want to do are already in place there if you're going to use if you're going to create your own context this is how marathon on marathon looks we create the context live you can see it up here all we're doing is a bunch of work and then we yield it's the final way that recovers the so that's a fairly simple example and then this is us using it so you already saw an example just two slides ago of using the marathon marathon alright so I am going to what do I have I've got like 5 minutes maybe tell me I have 5 3, I have 3 minutes stare at the screen for a second marathon you're going to see this code in the test directory here we're checking the default user and all this is a confirmation that the user is the root user and nothing else so in this particular case it's a little bit tricky where I'm sending an error code of 1 if it's not root so that's a lot of magic there at the bottom but not a big deal if we actually wanted to run that we literally would say, shake down run this, you can see the test root marathon and this would run a couple things that are useful it will tell you what the version of the cluster that you're on it will tell you the version of pi test you're running on it will output effectively ran let's show the abbreviated version there's shake down test system that's basic in this particular example on setup I actually because I'm on a marathon on marathon environment I actually pronounce the version of marathon because I don't always know what's in the universe and I want to make sure that what I'm testing is the exact version so we are pretty much out of time I was going to give you a little bit more demoing if you want to just meet me at the booth I'll go straight out there from here and I can dive into some more details if you want am I saying anything? if you have more questions please go ahead and do it thanks