 All right, so we don't have slides, but I'm at T-Leia2 on Twitter, like Princess Leia T-Leia2. I'll tweet out the slides after my talk, if you can get in there. So I'm here to talk to you about open-source tools for distributed systems administration. First, I'll define what that means. By distributed, I mean the team is distributed. I work with systems administrators from multiple companies all over the world. So our team is made up of people from Australia and Russia, several in the US, several throughout Europe, and so we need ways to collaborate in that sort of environment. So my role on the team, I'm one of the systems administrators who has core en route on all the boxes. There are nine of us who have this, and then we're supported by a bunch of people who work on the team. Our job is to make sure the open-stack project infrastructure works, and that's why we're made up from various companies. We're all from companies that are contributing to open-stack, so I'm from Hewlett Packard Enterprise. We have contributors from IBM, Rackspace, and various other players inside the open-stack community. That's also why we all work from home. We don't have an office to go into to work together, so we all pretty much telecommute. So that's pretty much our team. We have requests to us for changes in the system to open-stack, sent to us via change requests in our system, rather than tickets and things you might see in traditional open source projects who have infrastructures. So we don't have typically bug reports, and people don't submit things to our mailing list. They'll either come and bother us on IRC and say, hey, this thing is broken, and we'll help them write a patch against our infrastructure, which I'll talk about. But we don't use a traditional bug system very much for people to report things to us. We expect them to help us solve problems in our team. This also means that the priority of requests that come into our system to admin team is not really determined by us. It's determined by people who want to work to help our infrastructure work, because it's all open source and anyone can submit patches. So we have all of our system stuff up on Git. If you go to git.openstack.org slash openstack.infra, it has a whole listing of, I think we have over 100 projects now inside of our infrastructure that help everything run. We have instructions available for anyone to submit patches against that, so we have very detailed, like, this is how you download the Git repository, this is how you submit changes to our code review system, and then the rest of us will pitch in and help you review that code and get it into our infrastructure. So as far as tooling, since we're an open source project and we work with a lot of companies inside of openstack, we have to only use open source tools, which is the other part of the title of the talk, it's only using open source tools to do all of our work. So we don't have a Slack channel, we use IRC, we use mailman mailing lists, and then on the side of what we actually run in the infrastructure, we run the full continuous integration system that openstack uses, so that's Garrett for code review, a series of tooling to connect that up Jenkins to run all the automated tests to test openstack, and then pretty much everything that it's developer will interact with, so all the wikis, all the IRC bots that run an openstack channel, the page spin, the user planet, everything that runs an openstack that people interact with is pretty much run by the infrastructure. So we built this continuous integration system for openstack, and we had sort of been running our systems traditionally, like people would submit bug reports, and we would collect them, and then we'd work on them, but we realized we had this whole awesome continuous integration system, and we could start testing our system changes as well as all the openstack changes. So the first tool that I want to talk about is our continuous integration system, so openstack is pretty much written in Python, so we already had a lot of Python scripts in place, so in the infrastructure team, we pretty much decided early on that we'd write everything in batch in Python for all of our system scripts, so that it would help openstack contributors help us with the infrastructure, and we could also run all of our automated tests that we were already using for openstack on all of our system stuff. We also made sure that all of these tests were completely automated, so it really was not much extra work for us as the infrastructure tests as well as the actual openstack project tests. So in our workflow, a developer or assistance administrator will download the Git repository of whatever they want to change, make the change, it'll be uploaded to Garret, which is the code review system, and then it goes to a thing called Zool, which is our gatekeeper. If anyone's watched Ghostbusters recently, Zool will queue everything in place and make sure dependencies are tested against each other. This will then go off to a job worker called Gearman, which then sends it off to our fleet of, I think we have eight Jenkins servers now. They're all set up as masters, they don't know each about each other, but the Gearman worker will distribute the things adequately across the Jenkins masters, depending on the criteria. So again, this is used for all of openstack, and then it's also used for our infrastructure. Some of the things we test, I mentioned Python, but we're also using a lot of puppet in our infrastructure. So we do puppet linting tests against our code to make sure that the puppet files look nice and are formatted nicely, because again, we're a team that's distributed all over the world with a bunch of companies, and we have random people submitting puppet changes to our code all of the time. So we wanted to make sure that it always looks nice and syntactically clean, and that was kind of the lowest barrier for our puppet test. We then added puppet parser validate, which actually does, I don't know if you call it unit testing, but it's like little tiny testing to make sure it mostly looks syntactically correct. This will make sure that brackets are closed and then your commas are there and other things. And then we also use beaker aspect to actually do full on like this thing can actually deploy tests, and that's a relatively new thing that we're still working through, but those are tests that are going really well so far. So then we're pretty sure once the code runs in these tests, which it does as soon as we upload the code, it's not merged by anyone yet. It runs all these tests, and then people come and, you know, humanly review all of it, and then it goes through another series of tests to make sure it didn't nothing merge in time. And then it finally lands in our infrastructure. We also have thrown in a few other tests over the years. We know the syntax of some of our XML files, so we'll do checks against the XML syntax to make sure that's correct. We want to keep a lot of our project files alphabetized, and it turns out humans are really bad at the alphabet. So we have an automated test to make sure that the files are alphabetized. We also did a lot of manual work when we were checking for people adding IRC channels, the OpenStack project, that we wanted certain permissions set up on them, so we wouldn't lose control of them or have to bug the FreeNode staff. So we now have an automated bot that will log on to FreeNode and check that the channel has the right permission, and then come back to us and say, yes, that test passed. So this is a really cool one, because I was the one who was always checking to make sure things were in-chance there, and I'm like, why am I doing this? We were able to just throw whatever test we wanted this, and it's actually been really cool for us. It also means, since we're using code review, we're obviously having our peers review our code. So we don't have a change system where people who don't know what they're doing are reviewing our stuff and taking forever. It just peers on the OpenStack infrastructure team for reviewing things. And since we're doing that before we're merging the code and making it go live, it prevents some funny things from happening. I had a funny slide of this patch that I wrote that was really stupid. It was like a double negative, if not this thing. And then one of my guys that I worked with was like, Liz, you could just put equals. I think what I was, you know how it happens. You were writing something, and then you changed your mind halfway through, and you rewrote it, and then you have this crazy thing in there. But it would have worked, and I could have just merged that in the infrastructure, and it would have been fine. But we have peer reviews. We can pick up on silly things like that. It also means that anyone, as I said, can submit changes to our infrastructure. So we have people from throughout the OpenStack project that I want to add a test, or I want to make a change, or in one case, some of the guys wanted to add an asterisk server. And all the core people on the team were like, we're not we're not running that. Because it would be in the cloud and we don't know about asterisks and PBX cards. So some volunteers in the community came along and said, actually, we know a lot about this. So they launched the asterisk server, and they have been the caretakers of that. So it really empowered the youth that the community and the company that they were working for, the company was able to put an allocation of resources to that. And we didn't have to block on, they didn't have blocked on us or anything. We reviewed the code and we tested it. That was fun. But we didn't actually have to work. So I mentioned it gets merged into the repository after it's been tested and all the code reviews been done. And then we use Puppet and Ansible to actually go and deploy those changes as soon as it gets merged. We also use like a BTS module in Puppet. So when we so some of the projects, we don't actually do releases on. So we'll just watch the master branch and every time an update is made that project, we'll just go live with it and hope nothing breaks. But we did a bunch of tests. So it totally won't break. So this means we don't log into our servers very often. Everything's pretty much done through code review. When we need to log into servers is usually when we need to check on logs or restart something that stops for some reason. But this also means that we don't a lot of people in the community don't have very good view into our system. So we want them to have a public cacti instance. So cacti.openstack.org has information of all of our servers. So one of the really good things about this is if someone wants to replicate part of our infrastructure, because we really like our CI system, we want other people to use it. So if they say, well, how big is your Garrett server? We say, I don't know, go look at cacti, it'll tell you how much RAM it has. It also allows other people to debug problems for us. Sometimes just fill up and we're not answering that very well. So if something happens, everyone's like, what's going on? Anyone can just go look at cacti and be like, well, log filled up your server and so everything stops. So it allows people to sort of be proactive about helping us debug things, even if they don't have a login to the server. There's a picture of cacti. You know what it looks like. We also use a tool called Puppet Board, which is a dashboard for Puppet. It's typically not something you'd run publicly. But if you go to puppetboard.openstack.org, we are. We've had to turn off a few things in Puppet Board to make it safe to run in public. But it does allow you to see when changes are being submitted. So if you submitted a patch, it gets merged, you're able to see when it merged and what exactly ran. So if your patch merges and it fails, like it didn't do the change you wanted to do, you can see in puppetboard whether it made that change or not. And then you can write a new patch to follow up and fix what broke without having to ask one of us to log into the machine and look at the log to find out what happened. This is probably my favorite part of like being able to not have to log in because I spent two years on the team before I got logged into the servers. So I didn't want to be bothering my coworkers all day being like, I broke a thing again, why the break? I could just go to the dashboard. We also have a lot of documentation. If you go to ci.openstack.org, we have like a link to a bunch of our system documentation that will document everything from how to submit a change to our puppet and test it. So you can do like testing on your side before, including all the modules and everything that we pull down. And it also gives you specific documentation as far as all of our servers go. So if you want to add a server to cacti, we'll tell you what the big file to edit, and then what repositories to submit that to. If you want to add a server to our infrastructure, we also have instructions for that. So the team that worked on asterisks, they can just look at our documentation and say, okay, we're going to read through this, we're going to propose the changes, then we're going to talk to the infrastructure team about what we're going to do here. And so then we had all the data that we needed to actually launch the server at that point. So automation is great. And not logging into servers is really nice. My SSHK lives at home that's far away. So we do have to, but we do have to do some things on servers, which has got, for us, we've managed this a bit with tooling and a bit with social constructs in our project. So we have come to the conclusion that complicated migrations and upgrades can't really be done through. When we upgrade Garrett, we don't just change the version number and then let puppet go to town, because that would destroy the stack. So we have manual processes in place. So in this case, if we're doing a manual migration or some sort of upgrade that we need to tension, we'll typically get together in an IRC meeting beforehand, and we'll work on an etherpad to come up with a plan of attack. The etherpad will typically have every single command that we plan on running during this migration or upgrade, all written out, even if it's easy, even if it's obvious, when you get in the zone and you're doing an upgrade, you forget how to run my SQL command. So we make sure it's all spelled out, and also we can review it with any etherpad, the collaborative tool. Everyone can make comments, they can make edits, and then when we review it, we can make sure that we're all on the same page and all the commands are correct. Plus, we're already used to reviewing each other's work, because every single patch that comes into our infrastructure is already reviewed. So culturally, it's not a really good place to do that. We also have this issue where when we initially launch a server, we're still doing that manually. We haven't quite figured out how to do that in a way that it's automatic for us, because we have to make some human decisions when we do deployment. And honestly, sometimes, even though we test all of our public code, sometimes we bring up a server and it still just doesn't work. We forgot some line in our Apache config or we didn't add a thing in public to actually start the daemon after we installed it. So we do still have a manual process for launching servers. But we do have system inside of our, like some of our root servers that can actually do the deployments. We have secret git repositories that have all of our passwords and keys and credentials and things. And so the root admins can actually have access to those and there are instructions on the machine on how to do some of the manual processes. And of course, as I said, passwords are not open source. We need to put those somewhere private, but they are still in git. And so that they're pretty safe and there's history and we can look back and see when changes happen. So as far as day-to-day work, we don't really use phone or video because we don't like it. So we're all on IRC all day. All of our channels are completely public and they're on free nodes. So we have an open stack intro channel, which a lot of people hang out in because then like whenever anything goes wrong with the development workflow, like there'll be a lot of people in infra talking about it. So I think we've got like 400 people in our channel right now just sort of watching and making sure Garrett hasn't broken or Jenkins didn't fall over. So that's sort of where we are home base. We've got open stack infra. When something goes horribly wrong, a bunch of us will pop over to open stack infra dash incident. And so we can focus on that. So like people will be joining the infra channel being like what's wrong, the world is on fire, but we'll be able to be in our incident channel focusing on actually fixing the problem. We do have some like people who will do interference in the main channel and they'll say like yes everything's broken. They're fixing it right now, but we'll be able to be in our own space to be able to work on it. We also have a open stack sprint channel. This was sort of born out of the fact that a lot of projects have in person sprint, but apparently as much as we don't like phones and voices and videos, we also don't like seeing each other in person. But we decided to do virtual sprints. We don't need to get together in person. So we have a sprint channel and so we want to work on something like we're, you know, we want to do like our upgrade from public three to public four for instance. We'll probably all work in the sprint channel, get all of our patches in line, make sure everything is lined up and reviewed. And we'll typically leave the infra channel that day and all of us just join the sprint channel so we can focus. All the logs are public. We have eavesdrops.openstack.org where you can see not only infrastructure logs, but all the open stack project logs. We also have weekly meetings and those are always logs with minutes posted in the same place. So the weekly meetings are kind of a check-in with everyone on the team. We review our priorities and we make sure we're all on the same page with what projects we should be working on. We use our page spin a lot, you know, for spitting out outputs. If someone has launched a service on one of our things, they're like, hey, can you look at this log and find out what it's doing? I can just dump the output in this log without giving them the log in. And I joke about in-person stuff, but we actually do get together in person every six months for an open stack developer summit. So we actually get to see each other and talk in real-time in person about our issues. And this isn't something that's really important to us. It's fine to have a team that's distributed all over the world, but we find that the culture tends to decay over time if you don't see each other in person. And it's something I was surprised by, honestly, because I really like my IRC and my cat. But it turns out we have to go and see each other. And it's really revitalizing the team and it helps us connect and helps us work together. Because someone could be really curmudgeoning on IRC, but as soon as we go out and have drinks, we think it's okay. One of the things we have struggled with that I know a lot of teams always struggle with is handling time zone. So we are, again, across all over the world. And I think our guy in Australia hates all of us. A lot of us are in the U.S. And so we talk off and he's sort of starting his day. And it's really hard. The only thing we've been able to do to improve the situation is add more people in the time zone. So they're not alone. So he often works with us in the evening and then he'll work with the people in Europe during his evening and their morning. And that's when he'll make all of the major changes. He'll merge patches, he'll restart services, usually when someone else is around. And then when he's alone, he has to work. He usually does reviews, maybe holding off on some of the major changes he might make. Because if no one else is around, he doesn't want to break everything and have to fix it by himself. Because that's no fun for anyone. We also manage a lot of servers in our deployment. So we have one guy who's the expert in elastic search. And then I pretty much run the translation server. So he's not going to touch any of those things unless one of us is around to more about those things. So really just adding more people in the time zone is really the only thing we've done to help that. We've worked on also improving our handoff between shifts. So if there are incidents, we don't have a formal way of doing it really, but we'll sort of let them know like hey this thing has been going on or we just restarted the code review server because it keeps all the RAM again. We'll sort of let them know what's been going on for the day. And I think that's definitely a room for it. It also makes the time zones also make it slower for onboarding. Since I was in the US where all the rest of the other route admins were, I was able to pitch in. Like they'd be working on something and I'd be like hey I didn't learn that yet. Can you tell me how to do that? But the people in the other time zones, they don't really get that opportunity quite as much. So they have to do more formalized meetings with us to make sure they understand all the new components. So onboarding is definitely slower for folks who are distributed further away since. But honestly, mostly the work, even in a distributed fashion, it works pretty well for us. Using all of these tools, they're all open source and we all pretty much love our job. So and it's an open source project. If anyone's bored or wants to get experience with ops, you're welcome to come hang out and learn about what we do and get some experience. I found that people who tend to do that often end up getting hired because we realize we now can't live without you. So I go to my boss, I'm like, can you hire someone? So you're looking for opportunities and experience and maybe even a job at some point, you're welcome to come join us. Open second for structure. And that's it. I have a couple of minutes for questions, I think. Any questions?