 It's kind of interesting given the title of your book, because as he was describing this test framework they built the set like years and years ago, it seems bizarrely similar to Ansible. They built something that looks a lot like Ansible as an integrated test framework, this is before Ansible even existed, because it's like, oh yeah, we kind of scribe the top, you know, all these hosts on the network and we can have these YAML files which say, you know, run these jobs on these hosts and I'm like, wow, this is like, just keep reinventing the wheel. Ladies and gentlemen, thank you for coming to the final session at DevCon 2018. So we have with us Micah Abbott from the Project Atomic at Red Hat and he's going to be talking about this using Ansible as an integration test framework, so welcome Micah. Thank you. We made it guys, last day of DevCon, last talk, thank you all for being here. I know everyone wants to get home most likely or go off to do other things. So yeah, my name's Micah. I work at Red Hat. I'm a senior quality engineer. I've been working on Red Hat Atomic host and now Red Hat Core OS and this is about how we were misusing Ansible to test Atomic host. This is going to be more like a retrospective on how we, on what we were doing because as we're moving to support Core OS we're going to take, we're actually taking a different tack and I'll get to that towards the end of the talk. So I'm going to start with talking about how we got to this point of using Ansible and then some of the problems we faced when we used this method and the solutions and sometimes just work arounds to the problems we encountered. So how did we get here? So back in December of 2014 I was hired as the first QE guy to work on Red Hat Atomic host. We had like little to no test process, little to no automation at the time. So I got tasked with the normal activities of you know, come up with a test plan, automation and we were really starting to invest in continuous integration and delivery. So I was also tasked with starting to help contribute in that area as well. The goals we had for testing Atomic host after we got ourselves settled was testing the integration of many parts rather than the separate parts itself. Atomic host is an OS, it's you know, there's a bunch of different packages, a bunch of different components that go into it. We didn't want to be responsible for testing each individual piece. We wanted to leverage the testing that already happened. But we wanted to make sure we put it all together, it worked correctly. We also wanted to make the automation easy to develop, easy to use and easy to integrate with the continuous integration system that we were working with. And we also had to overcome the challenges of working with an immutable host like Atomic host. For those of you who aren't familiar with it, it's based off of OS tree and RPM OS tree developed by the one and only Colin Walters back there. Sorry, Colin. But it presents certain challenges that you wouldn't normally get when working with a traditional Fedora or REL system. Most of the file system is immutable, so you can't write to the places you're normally accustomed to writing. There's no concept of YAM or DNF, so you can't install packages. And we're also trying to learn the container best practices. How do we put software and applications into a container and apply that to how we test it as well? So since we didn't really have anything specifically geared towards testing Atomic host, I started asking around to the people who did, who were working with it. And I found someone who had started working on this makeshift framework called UAT framework. It was a combination of Python and Behave and it utilized the Ansible API to talk to the host. So Behave is a behavior driven development tool built on top of Python. So I was kind of encouraged that maybe we could leverage that and get more test cases contributed from people outside of the QE organization. So in Behave particular, you write a test case saying I want to upgrade Atomic host and then it's up to the person implementing the test to go write the necessary code in the back end to translate that. So I was hopeful that we would just get a bunch of test cases in natural language and we would just be implementing it, but that didn't really pan out. This whole framework had a lot of layers of abstraction. So we were really slowed down as we tried to pick apart the pieces in the model and figure out where it broke in the different pieces, in different stack, in different use cases. Additionally, the Ansible API at this point, we're using Ansible 1.9 we started. It was really simple, but also really simple in terms of how to use it. But it was difficult in making it do the things you want it to do. So it was a weird juxtaposition. So we spent a lot of time underwinding errors, like I said, going through the abstraction and figuring out where the problems existed. The API was really poorly documented. I was basically in the Python shell selling help on the different functions, trying to figure out what it did. And then when Ansible 2.0 came out, it was like a whole new API. Introduced a lot of breaking changes and all of a sudden, we were left with a decision point. Do we continue to try to push this framework forward, or do we try to go in a different direction? And so we decided we should evaluate our options. We needed something that was easier to develop and use. We realized that a lot of our tests were basically just running commands over SSH on the remote host. And we wanted to be able to run the tests across the different variants of atomic host. So we had a fedora version, well, multiple fedora versions, the red hat version and the centOS version. So if we had a test that could run across all three platforms of little modification, that'd be awesome. Additionally, I'm kind of a lazy engineer and a terrible programmer, quite frankly. So I didn't want to invest the time to build up an entire new framework, a test framework, to achieve these goals. And since we were already using Ansible under the covers, I should just do this with an Ansible playbook maybe. So we started doing that. We took the existing sanity test case that we had for or should say sanity set of tests that we had for atomic host and we turned into a playbook around as a proof of concept around February in 2016. Being completely new to Ansible as a playbook, using a playbook, we didn't have, I didn't have any roles. It was just including files full of tasks, kind of looked like this. This is actually the get check out from the first commit I made to the repo. There's a couple of Ansible file modules, a couple of shell callouts and just a bunch of includes that are just relative to the the test itself. But we started to get some success. So I convinced the powers that be to host this, the set of tests called atomic host tests, awesome name, right? Under the project atomic GitHub organization on GitHub. We added vagrant support to the framework so that you could do a vagrant up, get an atomic host and it would automatically run the playbook for you. This, this, this sanity playbook or any other playbook didn't really take off in terms of people using it. It was sort of like a nice like, oh, look what we can do with vagrant, but didn't really plan out again. And then finally, I, I've been doing a lot of this work by myself with some help from the development team and other people. But we finally got a second quality engineer to work on these tests with me. And then that's when we started to like really, we're starting to get, get better at it. Like we were starting to learn more about Ansible, how to the proper way to use it. Start to incorporate roles, things like that. We were testing across multiple versions of atomic host. We were hitting a lot of the goals. So we were feeling really good about it. But there was a lot of pain. So number one, Ansible is terrible at handling reboots. You know, Ansible is a descriptive language where you're basically telling, you're describing the state of the system. So it's hard to describe a system as reboot. You know, it's either up or down. It's either doing some action. Reboot doesn't really fall into their, their model. The default output from Ansible playbooks is God awful. It's just JSON. Wait, where'd it go? It used to have, it'll come back later. I'm sorry. I forgot the order of my slides. Awesome. But anyways, it's like, it's just like JSON without any line breaks. It's really difficult to parse and debug and triage. So we were fighting with that when we've got errors. When we, when we chose Ansible, I had the idea of like, well, if we fail as early as possible, it proves it will prevent us from shipping an atomic host that has problems. What, well, the problem was that is, you know, if you fail early in your test, you're missing a lot of covers. Yes, exactly. So the other problem we had with this is that once the playbook fails, Ansible basically terminates all your SSH connections and it's hard to go back to the host and grab debug information, logs, or whatnot, you know, using just the normal Ansible model. And we had some difficulty selecting, like, which test from a playbook to run originally, because I said we were still kind of like Ansible novices and we hadn't quite solved all the problems yet. But number one, like Ansible is not a programming language. So like, we were completely misusing it as such. Like, we were trying to do, you know, conditional operations and complex operations that would be pretty easy in any other language. But in Ansible, you're trying to do it with the YAML and it's DSL. And that's when we kind of realized we made a huge mistake. But we were already too far into it, so we had to just keep pushing forward. And so we started to attack the problems that we had and try to figure out solutions to the different problems. A lot of these solutions, I gotta say, didn't come from myself. I came from the other engineer in our team. He's not here today, but his name's Mike Nguyen and he's been very helpful. And Jonathan as well over there, he helped us with this. Specifically, improving the group reboot handling. So one of the problems is just doing like a shutdown command or a shutdown minus R command in Ansible. The SSH connection can get terminated before Ansible knows it's actually been successful in issuing the command. So the common way to solve this now, and at the time when we came up with this, Ansible didn't really have a good way of telling its users how to handle reboots. It wasn't until about Ansible 2.0 where they actually put out an article saying like, well, this is how you do reboot guys, that makes perfect sense, right? So it's an asynchronous action. You add a sleep before the shutdown in a shell command that's logically combined. And that actually worked pretty well. I mean, we're actually able to reboot pretty consistently, but we still had other problems in that space. If the reboot command doesn't succeed, the host hasn't got down. So Ansible doesn't know it hasn't gone down. And when it comes back and looking for the host again in a further task, it just assumes that it's rebooted. But if you're expecting something to have changed because of the reboot, your tests are gonna fail. So what we came up with was comparing the boot ID from the proc file system. I'm not gonna show you the code because it's just taking the boot ID before you reboot and then comparing it to the next one when it comes back. So here's, this is the slide I want, this is the part I wanted to show you. This default output from Ansible is awful. So here I've run a command on the Atomicoast, Atomicoast status. It's supposed to give you a nice pretty output of, when you run it by as a user, gives you a nice pretty output of the state of the system in terms of the version you're running and the commit ID. But right now, if you try to parse that, you'd be pulling your hair out of your head because it's a mess. So Mike wrote a callback plugin to basically make it pretty. It looks a lot like this. It basically takes the result object from Ansible, breaks it into the different parts, like standard out, standard error, that the message, you get the return code and what not. It actually uses the line breaks that come in through standard out. So the same command now looks like this in our logs. So when we have an error in the test, we can go into the logs and easily pick out, the return code is not zero, for example. So there's a failure there, not in this example, but if it was a failure. And we can look at the standard out and we can even look at the standard error if it was a failure. So this is like a huge one. It made our lives so much easier once we could actually look at output and understand it. So the problem of capturing information from the host after a failure. Mike came up with this kind of hack way to use a handler to go onto the host and pull out the journal. It's a role that called a handler on failures, which then would go in and set up some names and grab the journal. I mean, you shouldn't have to do this much work with a test framework to be like, give me the logs from a failed system. But hey, we're butt into this so we had to do it and this actually works. I mean, it's not pretty, but it works. We were able to get logs and journals and what not. The failing fast problem. So how do we get the rest of the system that we missed or test the rest of the system that we missed? So Mike came up with this idea. I mean, Mike should really be giving this talk if he's not here. He's in Hawaii. Tough life, right? So he came up with this idea of a meta playbook. Once we switched to using Ansible 2.2 or 2.4, I can't remember which. We were able to use block and rescue more frequently or better suited to our use case. So if you can see this on the screen, we've got a block and rescue definition where it includes some tasks, sets a fact based on the results of the include task call. And then at the very bottom at the end of this, it rolls up all these facts into one. And then prints out a nice log of these parts of the test have failed or passed. But again, you shouldn't have to do this in a test framework. It should happen more naturally, right? So the problem of it being not a programming language, I mean, there's nothing you can really do to get around that necessarily. You can work around some of it by using inline Python. And I was told when I gave my slides to be reviewed by somebody that I should be prepared to defend this. Because I guess there are people who are better at Instable than I am who would just write a bunch of YAML to do this. But for example, you can do these kind of hacks where you're using like splits on strings and capturing those. If only another variable is set, are there else to use another variable? I mean, not fun to read by any matter of the, but these are the kind of hacks and tricks you can do with Python in your Instable paybooks if you care to do so. So selecting the subset of tests, that was a pretty easy one. We just started using tags. We used tag everything into one functional group. So this is the upgrade set of tasks. And this is the reboot set of tasks. And that made development a lot easier because then we could just say, oh, run these set of tests when we're trying to debug which one failed or whatnot. But it still wasn't really great because if you forgot to tag in a piece of the functionality, then you're wondering why it didn't work correctly. So the way our repo is set up right now is that at the top level, we have a roles directory, a callbacks, a callback plugins directory, and then a test directory. The test directory has all the playbooks, which require the roles at the top level. So you could do like a relative call to the role, but they didn't look pretty to us. So we just simlinked everything. Again, not the best solution, not the prettiest solution, but it got us around it. And it's been working so far. I mean, simlinks are simlinks. So the conclusion here. It was a lot of pain, not just some pain, a lot of pain, but it worked for our use case. And our use case was pretty well-defined. We had this immutable host, and the other thing I mentioned here is all these tests were basically single host test cases. So we were just running one playbook against a single host and just testing functionality. And that's it. We didn't get into the larger running a Kubernetes cluster across six nodes type of thing. We had a lot. It was relatively easy to develop, maintain, and execute these playbooks. It didn't require any specialized software on the host, which was big for an immutable host. All we needed was SSH, which every Linux system has, and Python, which most Linux systems have. And it was easy for us to run these from a Jenkins job. It made us, yeah, and the other big gain here was the ability to write the test once with a little bit of sugar to make it work across multiple variants. And then we end up with like this. This is our ad hoc dashboard basically on in the repo itself on GitHub. So here we're testing 11 different variants of atomic host that we run on our internal Jenkins. We then publish our results to an S3 bucket so you can get the logs. You can still get the logs. You can't see the jobs themselves publicly, but you can see the results publicly, which was important to us for the community distributions like Fedora and CentOS. And we got a nice badge just generated from badges.io. So it kind of looks like we know what we're doing. It's mostly green, so we must be doing something right. Yeah, and so this is like, this is our bar for success. Like we got, we're testing 11 some odd variants plus the rel variants that are internal. So we were doing pretty good. However, this is where, this is the swerve at the end of the talk. Don't do this. Just don't use Ansible as a test framework. It's not suited for it. It's, you saw all the problems we had. You saw the workarounds we had. Use something else. Like a bash script would probably be better at this point. You could use Ansible to like prototype some tests that fall into this model of, I need to run some commands on a host to make sure they ran correctly. That's a fine use case I think for Ansible because it has the underlying tech to do that or the line functionality to do that. Again, it's not a programming language. There's a, I think there's actually a pretty, there's a blog post out there that has this title like Ansible is not a programming language. And it instructs you not to do exactly what I just did. We had to use some hacks, workarounds and general abuse of Ansible. If we were going to start over, we would do something, we would use something like pie tests or avocado or open QA or whatever the, anything other than Ansible right now. I mean, I'm sad to say. And like I said at the beginning, now that we're working on Red Hat CoreOS, we got really lucky that the CoreOS guys had a really slick test framework called COLA that's written in Golang. It does things like provisioning in all the different clouds they support. It does things like capturing the logs and the errors automatically. You can write native Go functions and run them on the host. So it really expands like what you can do in terms of testing an immutable host. And I'm actually looking forward to using that in the future as we test CoreOS to completion. That's it. Thanks for listening. Is there any questions? Yes, sir, Adam. Oh, wait, microphone, sorry. So I'm just curious. This whole thing got written specifically for Fedora Atomic Host called Two Week Atomic and that has its own set of tests which run anything called Auto Cloud. And I didn't know until now that there was this whole other set of tests which is also being run on Fedora Atomic Host but it presumably isn't looping into the release process in any way. Just how did that happen? And are there plans to change it? Or communication. I mean, I don't have a good reason other than poor communication and we had much like a lot of the open source community just have a limited amount of time to get the things done you wanna get done. I remember having discussion with Kushal Das, right? Yeah, I had a few discussions with him and he, I think he actually got some of this working in Auto Cloud but we never got it like fully plumbed in. Yeah, this stuff came first, to be clear. Like you were first then Two Week Atomic stuff came after. I just checked the timelines. Oh, okay. Seven, 20, 14 or something, you know? Well, that's when I got hired. We didn't start doing this until like 2016. Yeah, I mean, I think there was a little bit of overlap. Yeah, so. Anything else? So go ahead, Mike. So in my workplace, we're using Ansible for patching homogenous workstations and dozens of servers running very different applications. And usually our patching verification is, oh, make sure the service is running, make sure it's listening on a port. It seems like, even if we were patching them manually, it seems like this would be a good way to do those basic system and verification. Yeah, I think that's a valid use case for Ansible, you know? Describing the state of the system, and Ansible does a very good job of like once you describe the state of the system and make sure it does everything it can to make sure it is that state. So if you're saying like, apply this patch and make sure this service is started, it will make damn sure that as best as it can that that service is started and that's like your verification check. Right, I bet like making sure actually started successfully though. Yeah, yeah, yeah. I mean, you can, I believe like the service module for Ansible has like ways to configure it, how to trigger success. Like you can look for a particular port being open like you mentioned or a particular string that gets output in the logs or whatnot. Don't quote me on that because I haven't actually used those. Right, I guess a simple example is the service is enabled but oh, it doesn't start at boot. You have to manually start at five minutes later. That's the kind of thing we want to detect. Which I feel like this would be a good way to solve it to be honest. Yeah, I mean, Ansible has its use cases for sure. I mean, like I said, as a test framework, like we didn't, I mean, Jonathan and I had this discussion and we kind of came to the conclusion of we got the mileage we needed out of it but it's time to go on to different things. Like use Ansible for configuration management like large scale deployments like that kind of thing. Don't use it to test software. Anything else? I'm sorry if I disappoint you by showing up and telling you that it doesn't work, that you were expecting to be like find this magical unicorn of software but them's the breaks, okay. Is there a plan to try and be more unified with testing isn't as cool a thing in the future? Because I mean, the whole thing about this basically is that no one's maintaining it as dumb. So in Fedora side, the auto cloud stuff is kind of dead. It just sits there and runs, but no one's maintaining it. So is there anyone planning to sort of move forward with the COLA or anything else and try and be more integrate and didn't have all the things tested using the same thing? I'll say yes, there was somebody from Fedora QE. Remember his name, Jonathan or Colin? He was commenting on the Fedora tracker like looking for collaboration in Fedora QA. Kalev or Kalev? Not Kalev Lambert. Oh, Camille, right? Camille, yes. So Camille's reached out to us. He's, and I was actually encouraged that he's getting involved as early as possible. We're not quite there yet to like, especially on the Fedora side, like Fedora Core has still like very much being defined. So we're not quite there yet to be like, hey, here's our tests and here's what you, we don't even have an image for them to test against. So we got to crawl before we can walk. But yeah, the short answer is yes, we're having those conversations. Yeah, I don't know if I've been told to sort of try and create it as well as to make sure it's all getting hooked up compared to this last thing. That's it. Have a good DevConf, everybody. Thank you very much, Micah. So that was a final talk for the DevConf. And right now at 3 p.m., we have, we are distributing a lot of fantastic prizes for a trivia game in the closing ceremony, which starts at 3 p.m. in the keynote room that's the Metcalfe Large. So see you guys there. You don't want to run into a casual start-up service after a reboot. You want to verify that, I mean it's going to move. But you know, your virtual organization cluster has to be just coming to the end of the month, so it's going to find out that it's going to be there. It's going to be canceled.