 All right, hello, can you hear me? I can hear me, so that's good. All right, so my name is Matthew Stone. I am a senior software engineer for the Stackstorm project. We're an open source event driven automation platform, which I will get a little bit into later. This talk is gonna cover basically, it's effectively a story. And like all good stories, there's a villain and there's a hero. The villain in this case is software release and us releasing Stackstorm in general. And if you've ever been a software developer, you know how much of a villain software releasing can actually be. And the hero, I wish I could say, was exclusively Stackstorm, but it's a little bit more like The Avengers. There's many different tools that come together to actually make up the hero of the day for solving some of our release problems. So I'm gonna talk first about our release process a little bit and why it was such a pain and then go into how we use Stackstorm and some other tools to help the release process be a little bit better for us. So our release process is a little bit like the Wild Wild West and you'll also notice this whole entire presentation has a Wild West theme. So you're welcome. So our release process starts by us kind of dividing up work. We would look at what we wanted to accomplish for the release. We would look at the big features we wanted to add, the new refactors that we wanted to push in, any bug fixes we needed to have. And we would divide that work up between our team members. We would also have plenty of community contributions during the course of our release, as well as any new bugs that got reported. We would try and fix that during the release cycle as well, get that into the release. So by the time we got to release time, the delta between the previous version and the current version that we're trying to lease can actually be pretty big. And we use some standard things to kind of combat there being breaking things inside of our code base. Things like unit testing and integration testing, things you all be familiar with. One of the cool things we do with Stackstorm that was prior to even the stuff that I'm talking about now as we do, end-in testing, which is basically functional testing, but a little bit more beefed up, so to speak. So what we have is we have a CI node inside of our infrastructure that basically every time there's a new code release or every time there's a new code pushed into master, we'll pull that down and we'll run all of our different supported platforms, we'll run end-in tests on them. And those end-in tests ends up being things like running actual Stackstorm commands, running different actions, stuff like that, to make sure that the software that we're releasing or the software that's in master is actually functioning correctly. But there's a couple of problems with it. Number one, one of the gaps is that it doesn't cover all cases, obviously. There's some gaps as far as just things that aren't covered in general. The second problem is that it doesn't last for very long. So we stand up an instance, we run our software, test the things that we're gonna test, and then turn it off. And so it doesn't, anything that might crop up like a memory leak or something like that that exposes itself over time, it doesn't get caught by that. And so we needed a way to solve some of these problems. And what we were seeing is when we got to the end of release time, when we were about to actually release our new software, we do QA right before that. And so we actually, as a team, we kind of all hands on deck, get on board, download the newest version, run it, see what problems crop up. And we were finding a lot of bugs in that process which inevitably delays the release of Sackstorm. So if we have a new release coming out, we find several bugs, we can't just fix them quickly and throw it out there and say, you know, whatever, let's just throw it out there, we fixed it. We've got to actually go through QA again, test to make sure that the news fixes are there. And so we needed a way of catching any kind of breaking changes earlier in the process. So what we did is we needed a solution. We had a meeting right after one of our releases and the meeting was basically consisted of us trying to figure out, brainstorm about what we could do to prevent some of these gaps and close some of these bugs, find them out earlier. So some of the ideas that got thrown out there were like an RC release, like a release candidate, where we would have, earlier in the process, we would do a release candidate, have it available for people to download and test. One of the problems we had with that, which we actually still might do that at some point, but one of the problems we had with that is not being as much of a consumed product to say things like Linux or whatever, we have access to much fewer people to be able to actually download and test that. And so we were afraid that the feedback we would get from that wouldn't be as good as what it would normally be for a much broader consumed product. Another one we had, another idea we had was to leverage a demo server we have that's in our infrastructure to anytime a new code was committed to master, we would rebuild our demo server. The problem with that, and the reason why we didn't go with that is that that server is used relatively and frequently compared to some of our other instances. So if there was a bug with that, we would find it really late in the process. And so we would have similar problem than what we had previously. So we went a bit nuclear with it. We decided that it was a little bit inspired by the Canary processor, the way that they do SaaS offerings, they'll have a Canary release, which is basically, if you've got a thousand instances of your application running, make a subset of them run the new code and then find out any errors that come back from that. If there's any kind of breaking changes in that, that's okay, let's roll back those instances, fix the problems and then release again. And so we wanted to do something similar to that, but we knew we'd have to modify it a little bit because we're not as big as it's like a SaaS play. Like we have one instance that runs and does our CI force. So we decided that we wanted to rebuild our CI server every day. That's our most used server in our infrastructure. It runs Stackstorm to run those end-to-end tests as I was talking about earlier. And we knew that because we use it so frequently, it would expose a lot of the problems that we were having at the end of QA. If we were running the latest version of Stackstorm, we would see any bugs that would crop up immediately. So we knew that that would have some pretty interesting implications. And so we decided to, we kind of laid out some assumptions that we had prior to embarking on this. And so some of the assumptions we made is that, number one, bugs are gonna happen. We're gonna need all of this defined in code and easily recoverable. So if we actually have a problem, we need to be able to recover very quickly from it because it's our CI server, it can't be down for very long. We also knew that it wasn't gonna be very fun at the beginning of it. It's gonna be a pretty big learning curve and a pretty big growing pains for actually implementing this and using it. And so we kind of went into it knowing that it was gonna be painful on the front end. The other thing is we were hoping that it would increase the importance of stuff that gets merged back into master. And so we knew that if people know that the CI server very well could break if their change that they're trying to push is inadequate or breaks something in some way, number one, it's gonna put more emphasis on the person who's actually trying to merge it to be more diligent in their testing and of their own investigation and making sure this is solving a problem and not breaking more things. And it also put more on us on the reviewer to make sure that those guys are actually pulling down the code before they approve it and making sure that it's running properly. So with that, that presented some interesting technical problems, especially the first one of being able to recover very quickly from this if we had an issue. And so what we used for recovering or what we used to solve this is kind of three-fold and this is the Avengers stack storm Ansible and Terraform. I'm not sure who the Hulk is though and not, I didn't even get that right. Yeah, that's right. So Terraform, if you're not familiar with it is a DSL for kind of describing your low level infrastructure. Things like instances, things like storage, things like load balancers, stuff like that. And so this keeps our low level infrastructure defined in code so that we can reuse it and not only that, whenever we have to recover from something, we can rebuild it very easily. And so on top of that, we have Ansible and Ansible we're using for our configuration management of our CI node. It's kind of a non-starter for us to be able to automate this stuff if we don't have a clean way of rebuilding the server over and over again. So having like a bash grip that would try and do this or having something without configuration management would kind of just not work at all. It would break constantly. If you've ever managed an infrastructure without configuration management, you know what I'm talking about. So we used Ansible there, it also gave us the ability because we have an Ansible pack inside of Stackstorm, we were able to run Ansible in one of two ways. We can either manually run it through, or manually run it through Stackstorm because we have a pack so we can just say, hey, I want you to run this playbook against this server. Or we can run it with Ansible pool, which is our primary use inside of our infrastructure of how we actually get new configuration onto our servers. And so then on top of that, and finally we have Stackstorm. And with Stackstorm what we were doing is we were pulling all of those components together using Stackstorm and using a workflow inside of Stackstorm. So if you're not familiar with Stackstorm, I'll give you a brief overview of what it is. It's consists of actions, rules, and sensors. And it's a generic automation platform, so it's really anything in, anything out. I realize it's kind of a broad statement, like it's hard to really conceptualize that, but hopefully by the time I get done describing it, it'll make a little bit more sense. So on the sensor end, it's where we really kind of pull in events. And an event can be from anything. It can be from an OpenStack cluster. It can be from Sensor, it can be from Nagios. It can even be from Twitter, where I actually have a demo at our booth that uses our Twitter sensor for whenever an event happens, whenever we see certain matched keywords. And so that's the sensor end. On the other end we have actions, which consist of either an atomic action, which is just, hey, go do this thing, so it might be send a rest call or make a shell, or go enter a shell command into a server. Or it can be a complex workflow. So we leveraged the OpenStack Mistral project for our workflow engine, which allows you to kind of create simple or complex chains of all those actions, taking into account variables at each stage. So if you want to create a workflow that mimics what you do when you provision a data center or provision a certain application, you can write all of that out in Mistral and have it run through Stackstorm. If you want to have it auto-remediate something, so like you have an application failure, and you want to go log in and fix the problem, or even stand up new instances to fix the problem, you can lay all that out in Mistral to do that. And so we tie those two things together, sensors and actions with rules. And rules is where we get our if this, then that comparison. So basically a rule says, if a certain event meets this criteria, then go run this action or workflow. And so that at a high level is what Stackstorm is. And so we leveraged those things to automate the build process that I was telling you about earlier. And so our sensor in this case is a CircleCI, so we use CircleCI to kind of handle our events from GitHub. We start the process of building RAPM and DEB packages in that place, and then once that finishes up and it passes all our CI and integration tests, we kick that over to our actual Stackstorm instance. So we have a sensor on the CircleCI side that brings in that event. Once we see that, we have a rule that ties that to a workflow that then automates the process of building the new instance of the CI node that I've been telling you about. And so at a high level, the workflow builds the Terraform definition, plans the Terraform definition to make sure that there's no syntax errors, applies that definition to stand up the new server. Ansible pool then pulls in that configuration for the CI server, configures it. Once it's configured, we have the workflow basically is sitting there waiting for it to finish configuration. Once it sees that it's finished configuration, it will then test that node. We have kind of a test script that will run to make sure a node is actually operating properly. We don't really wanna transfer all of the load over to this server unless it's actually functioning correctly. So we test the server to make sure it's running correctly. Once we see that it's running correctly, we'll actually transfer all of the load balancing rules to point to the new server. Once that's done, we destroy the old node. And now we have a brand new CI server running on the latest version of Stackstorm. And so the next logical question after that is now that we have this and we've implemented this in our production CI server, did it actually work? Was it actually worth it? So again, continuing with the Western theme, I have the good, the bad, and the ugly of what happened. So the good is that in Stackstorm 2.3, which is gonna come out here pretty soon, we have a pretty big refactor of our API. And we were pretty scared going into it that that refactor was going to cause a lot of issues and we didn't wanna catch those at release time. My laptop just rebooted. All right, we didn't wanna catch that at release time. We wanted to catch that prior to any of those things happening. We wanted to catch that early in the release cycle. And so it turns out that once we merged that and actually started using it as our CI server, we didn't have any issues. That's a testament to the guy who was building it. We didn't have any issues with it at all. But what we did find, which if you've done any software development, this is kind of humorous and not surprising. What we did find is we had a lot of other issues that cropped up. It wasn't the API server that we thought that we refactored that we thought were gonna be problems. There was other stuff that was problems. And so we ended up fixing those pretty quickly. In fact, one of the problems was creating six gigs of log files every 12 hours. And so that's stuff that we wouldn't have found from our end-to-end testing. And so the good is that we found those issues early in the process. The bad is the, and I've lost my notes, so I'm gonna have to think a little bit. But the bad is we, it wasn't perfect. There's still gaps that are formed. Like there's still not coverage of everything. And so there's still issues that we have found over time through different various forms, especially things like QA, that weren't covered in the actual node. And so the other thing is that, on, where am I at, the bad? Anyway, there's other bad things that happened, but it was generally all right. And then the ugly is a few UX things. One of the problems with the UX is that we, because we completely destroy our old node and rebuild it, we don't have any of the historical data. So if there was a failure during our end-to-end test, we don't actually get to see it. Unless the node is still up. So if we have had a rebuild during that time and you haven't caught it, we don't actually get to go see the old failure. Which isn't a huge problem because we put most of that information into Slack, but it is a little bit inconvenient if you're trying to troubleshoot that. The other ugly thing is that whenever you have multiple CI nodes up at the same time, we check C names, because when we stand up our new instances, we check the C name to make sure that they aren't, there's not an overload of that. And so there can be false positives sometimes whenever we see two C names of the same type. And so if we have two CI servers, like if we're in the process of building one and haven't destroyed the other one yet, and they try and run the same end-to-end test, we'll actually get false positives of like, hey, this build actually didn't work when in fact it did. And so in conclusion, the biggest question we had is like, was it worth it? Did the good outweigh the bad? And ultimately we found out that yes, it has. We have, we found several bugs that we wouldn't have found any other way than doing this. And so the goal with software is to basically try and minimize your gaps that you missed. And so we've definitely closed some gaps. In the future we wanna improve this several different ways. One of the ways is by actually deploying this through an HA version of StackStorm where we have several different instances. Some of those will be completely rebuilt every time we run a new node or every time we run a new merge to master. And some of them will just be upgraded because we actually support upgrades from the previous version of StackStorm. And so we'll be able to test both of those things as well as preventing some of the other issues I was talking about like instance overload where we have two different CI nodes trying to build the same thing. So we wanna improve this in several different ways, but ultimately closing the gaps has been a positive thing for us. And we've already seen that as we enter into the 2.3 release, which is happening next week, we'll see that the QA process will be much smoother. So that's it. If you have any questions or you wanna talk to me a little bit more about how we're doing this or see some of the demo, again, we have a demo set up at our booth with a robot arm and a Twitter sensor so you can see some of the ways that we leverage StackStorm. Come talk to us, we'll be at the booth and love to see you there. Thanks.