 for doing our interview with us. Appreciate it. Thank you. So, Jason, I wanted to get a team of four things on crazy after your talk, so hopefully I'll catch up with you afterwards, too. Definitely. So I was just going to meet up here and kind of click bio. OK, that's fine, unless you have anything else you wanted. And I'll give you, like, a five minute warning. I'll be here in the front, so I'll just like, flash bio. So I can download what you're doing, where we'd like, and it just kind of goes over to here. Is it on? Uh, yes, it's on. Works out? Cool. OK, cool. Now, I will get started in a second here. No, they'll be able to go to their laptops or tablets and kind of do the same thing. Yeah, yeah, size right on one. Yeah, walking in like that, like, that's weird. See, you just get a picture of yourself that large. I'm like, ugh. Is it very weird at all right now? OK. Is my range kind of walk? Because I don't have a stamp here, usually. Yeah, OK. So I'll probably walk down there. All right, so after a while. To cozy up and make some new friends, we have here today Alex Juarez, a principal engineer at Rackspace. And wasn't paid to say this. He probably doesn't know I'm going to say it. But we are users of Rackspace that are company, cloud site servers. And you guys have pretty awesome chat support. So good job. I've been with the company for eight years. RHCA and RHCI enjoyed all things in the next crafting killer cocktail so fine that it's far later. I forgot that said that. And I'm using S-spray. So I'll turn it over to you, Alex. Woo. All right. That is not needed at all. But thank you. Thank you. All right, so the slides are already online and ready to go. If you bring up the URL there, sslides.unsupported.io slash explore-strace. At least when you're looking down at your laptops, checking your email, things like that, I always think you're looking at my slides, so it's helpful. If anybody wants anybody to get them, we'll leave it there for a few seconds more. All right. If you do bring them up, the cool part is as I go along, you all will follow along with me, OK? So like I said, your system calls in you a brief exploration of S-trace. I don't like standing up there very much. There's demos at the end. I'll come back up. But usually I'm down here walking around. Just makes it easier for me to kind of go through this. All right, so first off, hello. And I think a lot of people don't start off with at least hello, right? So hello. My name is Alex Suarez. I wasn't expecting an introduction, so that's why this slide is up there. I wasn't sure how the audience would sympathize with Adele or Lionel Richie, so they're both up there, right? So again, hello. And also, thank you for coming out. There was a lot of great talks at this time slot, something that I wanted to go to. It would just be weird if I didn't go to them, or I didn't go to this one instead. So thank you for coming to this talk. Thank you for spending your time with me. I really do appreciate it. My goal here really is to entertain you long enough to learn a little bit of something. Hopefully you walk out with a little bit more knowledge. Hopefully you know less knowledge. That would be weird as well. So how are we going to get there, right? Our agenda for today, kind of the topics we're going to talk about. First, we're going to talk about why this topic. Why spend 45 to 50 so many minutes talking about a very simple, very simple tool, very basic topic. Then we'll talk about what is S-Trace? The experience level in the room can probably vary from 10, 12, 15 plus year admins to two or three year Linux users. So I want to get also on the same page on what S-Trace is, the same blingo, the same jargon. And we'll talk about system calls, but you're kind of the core of everything going on here and the meat of it. After that, we'll do some demos, because what's the presentation without demos? Chance things to go horribly wrong. It's always fun. So far so good. Wi-Fi works. This is working. So that's kind of cool. Then we'll do continued learning. I can't show you everything about S-Trace or by anything in 45 minutes or 50 minutes. It would be impossible to really cover everything. So what I'd like to do as part of my role, at Rackspace as part of my role here is to get you enticed just enough to go out and learn a little bit more and give you the tools and the opportunities to do that in an easy fashion. Make it easy for you guys to learn more. Make it easy for you guys to go out and just pick up more information. Lastly, we'll do some Q&A. And by the time it should be lunch. So here we go. So why this topic? Again, why spend 45 minutes talking about S-Trace? For me, it's two things. The first thing is about solving problems. At Rackspace, one of the unique things that we do in a match hosting environment is that we get to see thousands upon thousands of configs every single day. While most people have environments where they run 10, 20, 100 plus servers, most of them are the same. Most of the configs are the same. They know what's running on them. At Rackspace, in a match hosting environment, we have thousands of configs every day, running everything from your normal LAMP stack, very basic stock install, two very custom apps, to the Java app that has been updated in five years because developers know where to be found. And so we have all those bits of information, bits of things going on. What that does is present us with lots of opportunities, we'll say, opportunities to learn and to figure out things. A lot of troubleshooting, okay. And also this talk is about going back to, phone's dead, there we go, going back to basics. Because we have so many configurations on so many different servers, we can't guarantee we're gonna have the latest and greatest monitoring or application software, right. We're not sure we're gonna have a new relic, right, or app dynamics or any of those things installed on there, any type of monitoring. So this is a talk about a tool that was pretty much came out, its last really big push was in 93, right. I'm talking about a tool that started in the 90s. Pour it over from sun. This is not the latest and greatest, right. But this is a tool that we can use to troubleshoot anything on almost any server anywhere, okay. That's why I call it a back to basics talk. That's why I find it super useful for the admins that I work with every day so they know how to use Strace in a very effective manner. Because no matter what's in that server, they can understand the very, very base core what the system calls are doing to troubleshoot what's going on, okay. So that's kind of why this topic, right. So what is Strace, okay. We'll cover two things. What does Strace do and how to use Strace, okay. First off, very simply, Strace can interrupt a process trapping the system calls and the return codes and give it to you on the screen, okay. Everything that the application is doing and how it communicates with the kernel, it's letting you know what's doing what those system calls are, okay. So right, what are system calls, right. I've already introduced a new term, a new topic. System calls are those functions that the kernel exposes out for you to use from your application. Most applications don't call system calls directly. They call libraries, right. Things that are implemented by Glib C or different C libraries to then talk to that hardware directly, okay. So your system call is that kind of interface between what you're writing in code and what you actually happen with the system, right, with the kernel, okay. There we go, there we go, yeah. So very basic, right, jumping right into it. We use Strace and first off by doing Strace and then the application we want to trace for you simply Strace and W. We'll do all the examples you see up here, we'll do demos of. So don't worry about what the output is in the next couple of screens. We'll do demos of it and we'll kind of go through that step by step in the demo section. But what you might get is something like this, all right. You'll get the system calls that you see up here, okay. Hopefully if you can't see it up here on the screen, it's a little small I think. Hopefully you can see it on your devices or your laptops, think of that question. That'd be great, I'm ugly, it's fine. They don't want to see me anyway. Yeah. If we can, I'd like, I'd prefer them down. Maybe not, ooh, all right, I like that. A little better, yep. And I'm okay with that. Like any good admin, you don't know he's there, he's doing his job well. Oh wow, okay. Wait till something screws up in a second. Okay, so as an example I'll put you my get, right. We can build upon that command and then, oh phone, thank you phone. We can build upon that command and then a dash V to make it a bit more verbose and get all the gory details. If you can't see the screen, notice up here this first line, the sec line. What's happening here is it's combining all of your environment variables into a small structure and not showing them to you. But with a dash V, it goes out and gives you all those gory details, okay? You know, you may hear sometimes that some like cron doesn't run with some environment variables, right? You can probably verify that by S-tracing and see what environment variables it does have. What environment variables do get passed to it, right? So it can get really deep and down in all the little gory details, okay? You see all your environment, you see your path up there, your home directory, right? Your logon name, all those little details, okay? We can also trace processes that are already running. This is useful for, again, for that Java app that no one wants to restart and your server has been restarted in three years. We see those, somebody has one of those. Oh, those are fun. So we can trace a process that's already running, right? Also useful, an example we'll see later on for tracing like daemons like Apache. And you can have a web server, right? I think it's already running, you don't wanna kill off. We can do that, okay? Now, in doing this talk a couple of times, this example always comes up, right? Big old warning. When you trace a process, Strace sends a sig trap or signal to the process. Some processes can detect that and just shut down or act unexpectedly. Usually these are proprietary installers, think Plesk, right? Something they don't want you to kind of reverse engineer, okay? Something that involves a serial number or something like that. This happens all the time with us. When we install Plesk, it doesn't go quite right. We wanna figure out what's going on, we'll Strace it. And then they'll remember that we can't because it detects that and kills out and says we're being traced, we're stopped. So literally your Strace process sends a sig trap and your application is like, whoop, hold on. It's a trap, hold on. Now it may continue on and continue to run and give you what you want back or it may continue to just die, get a sig stop, a sig kill, right? It depends on what application might be coded. So keep that in mind. If it's something that you really can't have die off and it's acting just the way you want to, maybe look at other options, okay? But again, it's very, very rare that this happens. Again, mostly Plesk, right? It's kind of the big one we've seen all the time. Small applications can give you quite a bit of output, even just small ones like W or Password or things like that. So we can save the output to a file simply like this, Strace-O file and command or the PID we want, okay? Ooh, wait, let's figure out what happened here. That's good, thank you. All right, no big deal. Where is, where is caffeine? Where'd you go caffeine? Is up there anywhere? All right, put it on, okay. Let me turn off the displays real quick. Well, I know what's going on here. It's on the Strace-O. Let's see here. Turn off mirroring and there is caffeine. Awesome. Display, arrangement, mirrors. For people to do present, caffeine is a great little tool against one of those things that just continues to, where'd you go, Firefox? There we go. All right, caffeine just continues to move your mouse for you, quote unquote. All right, so again, we can direct the output to a file if we want to. Normally it goes to standard error. Just keep that in mind, not standard out, but standard error. Try if you try to pipe it and get the first 10 lines or something like that. Just need to redirect standard error to standard out. All right, so next. We can also follow child processes. This is, again, we'll see this a bit more explained in one of our last demos, where we Strace Apache, okay? For those who are familiar with Apache, you may know that the root process, kind of the main PID, but it spins off all the child processes that handle all of the connections. We don't want to follow the root process, we want to follow that child process, those child total processes more than anything else, okay? So we can do that as well as you see up here. And these commands are just building on top of each other, right? What we're seeing here is we're doing the dash ff to follow the children dash o to put it out to them a file and then a PID for wherever, maybe the Apache process, maybe it's something else, okay? And the output's gonna look something like that. You'll see what you'll get from that command is all the files individually labeled with whatever name you gave it, as well as the PID number, this will say PID number for it, okay? So useful stuff. We've seen this use it multiple times when the scenario is that Apache is segfaulting over and over and over, and we think it's the PHP file somewhere in there. What we can do is do that as trace, follow the children and attach to the Apache process. What we'll get is we'll see a process that they all attach and all die off really quickly. And the last line of every single one of those files will be open such and such PHP file and then segfault, okay? So super useful and super helpful when you wanna show a customer, look, it's this file, all these processes die from the same file. Did you change anything? No, right? Except that one file, right? I've been on phone with customers for 45 minutes to an hour. You change nothing, no, no. Show them that. Oh, we did change that one thing, but I shouldn't affect it, no, it's part of the framework. So, all right. You can also use Strace to not only get the output, but get stats, get a summary of the system calls that were made, any errors, and how much time they spent on each system call. This is useful when, for example, I'm not sure where to start looking at a process, right? We've already seen here that the output can be quite big and we'll see it again in the demos. So it's helpful to see, okay, where is it spending most of its time? Is it spending most of its time malaking memory? Is it spending most of its time opening files, checking for other files? Right, we'll see that here. And what you'll see is the output looks something like, look at that, okay? You'll see here that you get a percent time that's spent and it's percent time in system calls, okay? From the man page, it's trying to measure that from the time difference between going into system call and coming out of the system call, okay? You'll get the number of seconds it's spent. I believe that's wall time if I remember correctly. And then the number of calls it does, the number of errors, the number of errors are the number of, if I remember correctly here, the number of negative one return codes for an error, okay? I think the file wasn't there, right? Maybe something that's not going to be detrimental, but again, maybe the file wasn't there, maybe you can have permissions, any of those things would cause an error on something like that, an open like that right there, okay? All right, let's see here. And then lastly here, we have how a process can exit, right? We'll get that at the very end of that sTrace file. Useful to see if the process did exit out correctly, right, useful, or if it was killed by a signal or with a different return code, okay? Now we can get the return code from a bash, you know, a variable just easily, but we can save this and look at later on, okay? So that's kind of a quick rundown of some of the flags I find useful and how to use sTrace, okay? Our next thing here is going to be system calls, okay? This slide came up a little earlier, right? I want to make sure I introduce a topic or a term and define that very easily on, very early on. And the same one here, right? A system call is simply that as a function provided by the kernel that the application uses to access, make hardware calls, right? To write a file, to get memory pointers, things like that, okay? Now there are about 441 different system calls. I'll show you how I got that number later on. Some of them are synonyms for others, so not quite that many, but there are quite a number of them. We'll go over a few of them today that I find useful, kind of the first ones I want you all to know, to kind of really, that most programs are doing, right? Opening a file, reading a file, checking the stats of a file. We'll look at those very briefly. And also, this is one of those sections where I want to talk about finding out more information and learning a bit more, okay? The man pages, usually we hit man and then a command, we get a man page up, right? Now, which section of the man page we get can vary, okay? The man pages are built out of different sections. Section number two is for system calls. Now, fine and dandy, but there are some commands, or as you see here, a bash built-in that have the same name as those system calls. For example, here read, okay? If you do just man read, you'll get the bash built-in man page. Not what we're looking for when we're trying to figure out what a system call is doing. So, simply, man two read, and then we get, you see up here, the system call man pages, okay? And this works for, you can do man two and then any system call and get the information back, okay? Keep in mind some may bring up the system call page first, some may bring up another one first, right? Stats are a great example of that. You run man stat, you're gonna get the man page for it, you're not gonna get the system call page for it, okay? And we'll see that later on as well. Oh, let's see here. I really regret throwing a 30 second time on my phone now. All right, a note on system calls. The good thing about system calls is that they all have a similar structure, okay? As we see up here, okay, you have the system call. I'm gonna do this real quick here. You have a system call, you have a read, then you have any arguments, right, that you might have for a system call, and then a return code, okay? And the beauty is that all of these system calls have the same structure. They'll have the command, any arguments, and the return value, okay? And from the man page, you can look at what the return value should be, what it means, right? And what those structures are that you are doing inside those system calls, okay? All right, so let's cover a few of them that I find useful, especially when someone's just learning about system S trace, kind of ones you're gonna see almost every single program doing. So let's start with this. Open it, whoa, there we go, there we go. Opening a file, right, opening or possibly creating a file. Most applications are gonna want to open a file, create a file, things like that. What you'll get up there is you'll see, you'll give it a path name for it. You'll see any flags you wanna do, use of it, to read only, open, append, any of those things, right? And what you would say, you give the path name and open returns a file descriptor, right? A file descriptor is simply just a small number that the program will use to reference that file later on, okay? And what can you reference that file would exactly be the read system call. Notice here, what's expecting is a file descriptor, okay? Coming from the return code of that file which opened, if you will, okay? On success, returns the number of bytes read, okay? If you see that you're requesting to read 800 bytes and you got 400, well, maybe you had an error. Maybe you read to the end of the file, right? So, you can look at your system call and see, hey, am I reading as much as I expect to be reading? Yes, no, why or why not, okay? And then fstat or stat, again, this is another one of those where if you just run man stat or man stat, you'll get the, I believe the application man page, but if you do a man two stat or fstat, you'll get the, well, syscall, man page, there you go, okay? And this returns the information about a file. This is what the stack command shows you, right? It returns the I know the file and in a structure, you can begin to read those variables out, right? For example, the I know number, the permissions for it, things like that, okay? And then lastly, yeah, lastly, we have the syscalls for mapping memory, right? To map memory, to unmap memory, get a pointer back, right? We've seen it where applications just go through and map a ton of memory and we can show it's doing that by looking at system calls, right? So yeah, I can notice some of the most basic ones, or at least the most introductory ones, right? The four ones you're gonna first see when you start looking at different applications. All right, so enough of me talking, let's do some demos, it's gonna be fun. You get a chance to see me type horribly, okay? Real quick here, so on the slides, if you don't see a quick video up here, just refresh your page, it comes up sometimes on a second time. I have these videos up here as part of the idea that you take these slides later on and whatever demos I'm doing on the screen, you'll see up here later on, right? Obviously the demos aren't being recorded, so I try to provide these for you all as a kind of base to what I did up here on the screen because usually that's hard to read and we'll go ahead and look at, for example, that's not my one at all. There we go, what's that? Oh, anybody got some? No, that my friend is the 50 year, I do not have that. $36,000 bottle. There was two in the US three years ago and there's one that came to the US, well Canada last year, so they're very, very rare. That one in San Antonio, with kind of just a small little plug for San Antonio. That's after the presentation. All right, real quick I'll show you what the demo looks like here and then we'll switch the terminal so that we can get a better view of it, okay? When you hit play here, what you'll see here is just, I think it's quite small and we'll do it better, but again, the idea of what's going on here. It's all, it's a thing called ASCII Cinema, you can copy and paste anything here, right? Play it back, kind of step it through, all fun stuff as part of kind of take it back and use it later on. But for here on the screen, right? So boom, Strace dash W, right? Again, the first example here, how to use Strace? Very simply, we can run a program and run Strace around that, trace program from the get go and we see here, right? Notice what we see here, also up here is this first line, this is the exact line, where'd it go? I probably missed it already. Let's do this, this is easier. Strace W, two, one, there we go, ha ha. Again, everything goes to standard error, switch to standard out. Again, first part of almost any Strace you're gonna do is this first exact line, okay? Notice that we saw earlier that if we just do with Strace and W, we get very small condensed output, okay? All right, yeah? All right, cool. Thank you all, appreciate that. Okay, so what you see here, right? Is this first line, very, very simple, condensed, right? So building upon that, as we saw in our usage, we're just gonna add a dash V to that. Again, real quick in case anybody's unfamiliar where I'm doing this part right here, it's not part of Strace itself. I'm redirecting standard error to standard out so I can just grab the first 10 lines of that with the head command, okay? Again, not part of Strace itself, just a tool I'm using. So, right? So that being said, we see now our first line here, a bit more, right? All of our environment variables, everything's expanded out for us, okay? We kinda wanna get the nitty gritty of it. Okay, here we have a program not running quite right, but you think maybe it's not getting a variable or a path or something like exported properly to it. Somebody else can help you out, okay? And again, lastly, we see here our return code, okay? All right, so let's switch back to, make sure I'm going with the slides here, right? So, the next one here, Strace dash C and W. I'm not gonna run it this time, I'm just gonna switch back to the screen here, Strace dash C and then W, right? And when we get this time here is we get the percentage of the time spent on the calls where it's spending all its time, okay? So, if we notice here, 245 calls to open, all right? Still being back here real quick, take off the head here, all right? We'll notice that a simple program like W spends quite a bit of time opening, oh, let's see here, quite a few files from proc, okay? Just to get some stats for itself, okay? So it spends a lot of its time there. Sometimes tech comes to me and says, I wanna learn how to just get better at programming and get better at X, Y, and Z. Project I've given in the past is use Strace, trace your simple tools, rewrite them in your language of your choice to learn how to use it, right? To re-implement some things. So, you should be able to look at this and kind of get an idea of what's going on and what files it's reading, where it's getting stats, things like that, okay? All right, so, again, we see here where it's spending all its time. Now, for example, this program I wasn't familiar with, I might start looking at, okay, we'll spend some time looking at all the opens, right? Now, I'll show you a demo in that here in just a second here. Let me show them all my slides. All right, we can do something like this here, Strace.io, password.out, okay? Right, and again, the O, local machine, don't you try and help on that now. We'll take a small amount of time to figure that password out. I've learned my lesson before. So, when I do a dash O, what I get is everything with the standard error now goes to a file, and I can interact with the program as I would normally, like you see here, everything going in the standard out, that should, okay? And I can look at my file and say, less password.out, and see my file there, right? I can look through it, see what's going on. Kind of have some fun with that. Okay, now, let's see here. Are these demos going too fast, too slow? Should I slow them down? Good, all right, I got some thumbs up, all right. No one's saying no, awesome. All right, so, here's a fun one. So, we had a scenario one time where, well, let me back up a little bit. As part of my role at Rackspace, one thing I do is teach Red Hat classes, and part of those classes are to have Strace enabled for everything, and a question came up, why wouldn't you have, or not Strace, I'm sorry, wrong presentation, SE Linux enabled for everything, and part of that, I mean, covering the group, there's an exercise that's been covering the group password where we need to relabel all the files. And the question came up, why do we need to relabel, we're just changing our password. And so, what we found out is that, part of this is going on, we said we do a ls-, I think it's i, eti shadow, right? We have, the ashi gives me the inode of that file, okay? And when we did a password, let's do a password again, password, okay? We saw that the inode of eti shadow changed. So hence, if we're in a covering mode with no ss-links enabled, it wouldn't get context. But we want to figure out what was going on there. So the first thing we did troubleshoot that was, we did, how you saw earlier, was Strace-o, you know, password.out, password file, okay? The password command, we did that, okay? And let's say we looked at all the files there, and we said, as a lot of lines look through, I want to look at the main lines, right? But we knew it had something to do with the file. We knew it had something to do with operation on eti shadow of some way. And so the next thing we can do with Strace is to limit the types of system calls that we collect, okay? I consider this a bit more of an advanced or a next step. That's why I include it in my first part. By one clue in the demo, I think it's super useful. And it's a way to begin to really dig down and filter out what we need. So, let's take a command here, I'll break it down for you. Strace-e, trace equals file, dash-o, password-trace.out, and the password command. What I'm telling Strace here is to only grab any operations that have to do with a file. Open, close, create, things like that, okay? And so, let's do that, okay? And then, now all of a sudden, our area to look through is now 142 lines. Okay, I'm looking at just the output, or the file operations. So, password-trace.out, right? And we see only all the opens, right, when we get. I'll shortcut a little bit here because that demo took forever or that kind of troubleshooting process took quite a bit of time. But what we found out was that, is that at some point, right, the password command renames a newly created file to Etsy Shadow, therefore, the new iNode, okay? Therefore, if you have Etsy Linux enabled and you need, or it's on a running system, you need a relabel everything there, okay? So, kind of a way we try to troubleshoot our own things to figure out what's going on and balance them like that, okay? You'd be surprised by the number of applications that do just that open new file and then copy everything back into place after the fact. So, kind of a fun one. Again, that one was just like this, dash-e, trace equals file, and you can do that trace on different families, not just file operations. I can say just the opens or just memory operations, okay? I can really narrow it down like that, okay? All right, and then we have another demo here. Again, that video there is the demo I just did. See a chance to see it later on as well in case you want to review it. Again, building out a little bit here. Again, not quite a new demo, but again, the difference here. All we're doing is the in-dash v here to get a bit more output, right? And then, well, again, what we see here is a bit more of an information, right? And right here is an example of what you might see as an error, if you look to dash-c, right? Again, nothing that's gonna kill the process, nothing that's fatal maybe, but maybe something that wasn't found the way it should be, okay? Sorry to say that again. So, the question is if I can filter out critical versus non-critical errors, and that's not that I found. When you get that dash-c part, let's see here, strace-c.w, right? That it's just returning back where it comes back as a negative one. So, not that I found. There might be, but not that I found. Come on, my coffee, I'm like, maybe it's out there. There's a whole lot out there. All right, let's see here, okay? And then one last demo here. This is a fun one. So, strace-ffo, our file, right? And then our PID. You see here the little hint here, ht-axis and file-axis. So, our scenario, we had a customer who had their Apache document route on an NFS mount, and it had been that way for a long time. Everything had worked just fine. All of a sudden, they brought new developers, and then things started to take a turn for the worse. Actually, let me sync this up here real quick. There we go. So, it basically got new developers, and then, like I said, it, their performance took a turn for the worse. And we asked them what they changed, what they changed, nothing, nothing, right? But something had to have changed. So, this demo here, I don't have it recorded. It just, it's kind of a fairly long demo, but I wanted to at least walk you through it to see on the screen at least once. If you have questions later on, come up to me and ask me, or reach out to me on email or whatever, we can do it again. But this demo here is on the server, I have a patchy server running, and all they have changed here for the server is, I've lowered the match detection per child just so I can see things, my threads spin up quicker than I want them to, or quicker than they would normally, okay? All right. And then, what they had done is, let's see here, where did it go? One second here, there we go. There it is, okay. All right, so let's do this real quick. I'm gonna bring up here and say, while true, actually, that's trace, all right, see, local boxes, while true, curl, okay, fair enough, okay? So I'm just creating traffic on my server here, okay? Now, what I'm gonna say here is say, system CTO, yes, I'm running REL7, or SIN7, status, HTPD, right? Because what I care about here is the, where'd you go? That right there. I'm getting the PID of the main Apache process in this case here, okay? And what I'm gonna do here is say, S trace dash FF, again, to follow those children, dash O and we'll say slash temp Apache, or let's say scaled dot out, and they'll get PID number after that, and then dash P there, okay? Did I get that wrong? That's weird, okay, hold on here. There we go, main PID, that's what I was looking for earlier. There we go, okay. Which is here is that because my max section is so low, I'm gonna speed up processes quickly, okay? And what we'll see here, if I just kill this off real quick, is I go to temp, I'll get all these files here, right? All those are the Apache processes that we're running, okay? It's like one of them really quickly here. I have it here. Okay, so the setup here is that ignore if you enable HT access on an Apache server, right? Every directory that you have in a path needs to be checked for HT access, okay? And this customer had, as they called it, their directory structure very organized. What that meant to us, it was a very deep and very wide, right? Just a lot of files, okay? So for example here, let's do this here, quick, and say curl, what a dumb, no, all right, I'm not even kidding, it was about this deep. Sorry? Oh, never. So let's go back here and say slash, there, okay? Let me double check something here on my settings. That one's good. Okay, am I that correct? I think I do. Let's see what happens here, okay? So while true, do that, right? Hello, hello, hello, awesome. And while it's going on, temp rm-rf scale, all right? And then our strace command here, okay? So again, I haven't changed Apache settings too much right now. And it's such great here real quick. Now I'll cut it off just in the interest of time and say we have, just picking any of these here. Now, look at that notice over here, it looks for ht-axis for just VAR2DUB, I have it turned on just for that, just so I can find it in here. But you know, it goes through the path of getting these stats and not a big deal, right? Let's see here. Now let's turn on, again, closing everything off, resetting everything, let's turn on ht-axis here. And this is gonna be in our VAR2DUB HTML directory, right, where all of our path is. We're gonna come down here and say, allow override. All. Because that's the easiest answer to get working right now. Not the best, but the easiest. Okay. All right, we're gonna save that, systemctl, restart ht-bd, systemctl. Every time I did that before, it says cto instead. Oh, all the time, right? Tap complete, I'm like, oh, dang it. All right, systemctl, restart ht-bd. Okay. That worked. Thank you. Timing, right? All right, so let's go back here and do this again. All right, systemctl, status, ht-bd service. Let's get our main pit again, all right? And then we're going back to our strace here of that, okay? Again, as you saw before, attach, attach, attach, great. Now, if this was segfaulting, you'd see attach, detach, detach, detach, and you'd find your error. But that's good here. Now, let's take a look at any of these here. Less, let's see your scale, that, nope. Oh, sorry, scale it out. All right, let's grab one of these, thank you. Now notice here is what I'm talking about here. What was happening was that, check that directory, that directory, that directory, right here, right here, right here. And all of a sudden, you're adding in a however number of calls to an NFS system that you don't really need, right? Why did they do that? Well, the developer said they needed to change some Apache settings, right? And there wasn't much, but they need, because of that one change, you add all this overhead. But they wouldn't believe us until we showed them something like this, right? How could such a small change have such a big impact? Okay, so you have that many more hits to an NFS directory, that many more calls on it, your performance is gonna suffer quite a bit, okay? So yeah, they wouldn't see it, right? Okay, it's just them, but now throw on their 20 rub heads, right? All in that same NFS directory. So again, this is my favorite example of how we can use Strace to show what's actually going on and where the impact is, okay? So with that, I'm almost going to close here. Continued learning, my last section here. Like I said, there's no way that I can show you everything. But what I'd like to do is leave you with some extra bit of, I hate the word homework, but opportunities to learn if you choose to do. Up there on GitHub, you can find some code examples, very, very simple C code examples that will help you illustrate when I do a simple open of a file or read of a file, what system calls does that translate to, right? What does it translate to? So I'll give you a real quick example of that on the system here itself. And let's see here, if I close this here, shut that off, cause we don't need that anymore. And let's see here CD, Strace code examples, okay? What you'll see here is some very, very simple C code. All right, hello world. As with anybody first programming out, hello world, right? But what we can do is compile that, GCC, let's see here, hello world, C, okay? Good. And let's run a to out, again the basic standard, just the normal file gives out there, hello world. But what does that turn into in system calls, right? We mentioned earlier that applications don't, bless you, bless you to all of y'all. Of course we'll have it. Strace here and say hello, oh sorry, a to out, right? What does those simple library calls try to be in Strace? There we go, right? We're gonna see here all the things it's doing just for a very simple hello world, okay? Now you can build upon these examples and say what if I write a file? What if I map to memory, right? And see what type of system calls you get back from that. So you begin to see as you title it together and you look at bigger applications, you see what they're doing, right? And you kind of get an idea of what's happening in there, okay? So again, just a real quick, small code examples, we look at the open file, go through and try to read a file, okay? And then again, following through with this, it's a real quick here, GCC, open file, let's see. That's fine, old code. But if we do a out file, there we go, right? But if we do this now and say things we learned earlier today, Strace dash O, my file dot out, back. Again, now we get to see what those small opens and reads look like when it comes to what's actually happening at the system level, okay? And that's a way that y'all can go and learn a bit more about that, right? Small code, find other code online and do the same thing, okay? It just helps you sort of to visualize what's actually going on code-wise to what's actually happening on the system itself. So, all right, let's see here. All right, so towards the end, I would have to kinda jump back to the beginning here. Let's do this here on my phone because like any good talk, right? Show you what I'm gonna talk about, talk about it, and then tell you what I told you. So kind of in summary here, we talked about why this topic or why I find it super important for text to understand and why I wanted to share it. Hopefully you all found it important and found it useful, can find it useful, walking out of here. Talk about what is S-Trace, talked about how to use it, right? And then we talked a little bit more about system calls and what those mean, okay? We did some demos up there, we did some horrible typing, or at least I did. Things work for the most part, which is kinda cool. It's very rare that it happens. And then we talked lastly about continued learning, right? How can y'all take this from this presentation out into your next troubleshooting adventure, I'll say, okay? So with that, again, because I'm not used to having an introduction, which is kinda very nice, thank you very much. Real quick, contact information. Except by the end, my name is Alex Juarez, I'm a principal engineer at Rackspace Hosting. My role there is a lot of training and mentoring of our front-end support texts. So doing things like this for them daily, kind of on a one-on-one, or doing things on a weekly basis with brown bags, kind of what I do there at Rackspace, okay? You can find me online. And then my friend, Jill, was trying to support me at Bio, which was right up there, so again, I might use to have this up there. So real quick, that's all. And then lastly, Q&A. Oh, we talked about whiskey. Let me talk up here real quick so I can see y'all if you have questions. Questions? Correct? Yeah, so question was S-Trace included in most distributions? Yes, S-Trace was poured over from Sun in 91, and then kind of merged together for a 2.5 version in 93. So it's been around for a long time and all the systems I've seen, it's on there. Yeah, question right there first and come back here. Questions, how can you S-Trace a cron job? For me, so it wouldn't be the cron job itself that I'm S-Tracing, I would be S-Tracing while the descriptor want to run, or S-Trace out the Damian Fort and follow its children that spins up. So that's how I attempt to do it, yeah. But you're just gonna be the script that I'm doing itself and seeing its environment variables that it's having. Question? I wanted to S-Trace something like Java, but it's really, it's called Unibasic by Dynamic Concepts. So this is a language that kind of stores its source as a P code, sort of like a Pascal like P code. And I'd like myself and the developers really are trying to figure out what in God's creation this thing does half the time. And so how would you, would S-Trace be an appropriate tool for that? When I said Java, you kind of laughed and I understand when you look at the Java JVM argument list. Yeah, so it's the question, could I use S-Trace to create something like Java or kind of reverse engineer some code, right? And it really depends on how it's implemented. It's like I said, something with an Installer like Plesk that would trigger for it, might just kill out. And really it's not just to reverse engineer the code itself, but to see what's happening on the system. Maybe what files you're trying to access, what memory you're trying to malloc, right? So it's not necessarily to reverse engineer that code, but again, more to figure out what ultimately is doing on the system. Is that good? Yeah, so it's using incurses which show that. Do I answer that, I'm not sure. I haven't come across that in my experience, at least we have troubleshoot stuff in the past. Usually what we're troubleshooting is gonna be, for example, what's it called? Coyote, right? So it's web backend or something like that. We're trying to troubleshoot. We might get, look at that a little more. But I haven't seen it use incurses or kind of in that experience before, so. Yeah, question, different? Question of the work with detrace. I do not work with detrace much. I know it's out there, just something I haven't used in my day-to-day. And the other questions, questions in the back. Okay, that's a great question. So the question is, I'm not tracing a process and I see follow-scripters are already out there. How do I know what they are? I'm glad that question came up. This is fun. If you look at the PID that you already have, go to PROC, let's say maybe, let's see, your CD PROC, right? All the directories in there are, those PIDs are PIDs, right? So within that, and actually this is a talk I submitted, didn't get accepted, so talk to the scale about that. If I go into any of those directories, those PIDs if you will, let's do one because you all are pretty sure what that is, right? And I go in there and look at the directory FD or follow-scripters, it will give me the follow-scripter to the file, okay? So that's where, you can tie us together, okay? Any other questions? You're very welcome. All right, any other questions? No, cool. Thank you. You guys hear me in the back? Yeah, thanks. Okay, so this is actually a talk that I gave internally at Adobe last fall, and it was really successful. I wanted to present it to a wider audience, get more feedback, and see how we might improve this tool. Like she said, my name's Christopher Edwards, I am, that is actually my title at Adobe, Saltmaster. I do all the salt things for Adobe, which is really fun. And this is about incorporating security and compliance into SaltStack for our security initiatives and compliance standards. So first of all, I want to talk about why did we do this? Why am I doing this? Why did we create this tool? I like pain. Primarily, well, I should get credit to my wife. She inspired me to do this. She works in Infosack, and I work in operations, not at Adobe, we don't both work at the same company, but after work we'll talk about our frustrations. She's trying to come up with the security guidelines to give to the operations people, and they always say, you guys probably know how it is. You're not my boss, why are you telling me how to run on my box, and how to set up my servers, and how to do all this stuff. And I'm on the outside, and I have the same frustration from the other end. So we were talking one day, and she said, you know, you really should take some time, go talk with the security people, see what their initiatives are, talk to them about your initiatives, see how you guys can collaborate, see how you can make each other's lives easier, right? So I went and had lunch with some of the security people, and we talked, and I got a good idea for what they were trying to accomplish, and how it aligned with what I was trying to accomplish, or also in some cases, how it didn't align with what I was trying to accomplish. And we started to have some meetings and discuss how we can make each other's lives easier. So we started, well, it kind of went like, they had bought some tool, and it was supposed to just solve all our problems. I just didn't know how that is, right? And here, just run this tool, like everything will be great, but from the operation side, I don't have any insight into the tool. I don't know what it's doing. I don't know where it's sending information. I don't have any insight into the results. What we had was they would run something, it would send it back to them, and then they would analyze it, make some report, then we have the whole ticket as a service, right? They'd make me a GR ticket, and then I'd use, and it was just a lot of this around and around, and it was really inefficient, and I was frustrated with it, and I think they were a little frustrated with it. So I started asking them about, what are your requirements? What do you really need to do with this? And what are you getting with your current tool? And one day I thought, you know, I think I could probably just achieve the same thing by leveraging SaltStack as this execution framework, and then essentially send these compliance audits to my systems and have them report back. So from my perspective, I wanted to do this so that I could gain greater insight into the compliance level of my systems. I don't really care so much about the security guys, like what, I mean, they do important stuff, but they're my machines, they're my babies, I'm in charge, my necks on the line if they break, I want to know if they're secure, whether they know or not is not terribly important to me. So I wanted to move the insight from them to me. I wanted to have more control over what my compliance level was and have better visibility into that than just having them tell me, oh, go patch this or go, here's another ticket, here's another ticket. So we went from this, just run this tool and then security, right? This is what we're using, and I wanted something that wasn't so much of a black box that worked a little bit better for me and wasn't just magic, right? So I decided, all right, how am I gonna do this? What should I, how should I approach this? So I wanted to leverage our existing tools and existing frameworks and not take new tools and new frameworks and add on. This was also about the same time where we had, well, I'm sure you guys are aware, Adobe's a big company, we have lots of different departments, and I was getting kind of instruction from a bunch of different departments, hey, run this agent on your systems and run this agent and run this, you know, security wanted something and monitoring wanted something and this other group wanted something and it was just felt like we were just gonna pile more and more agents onto my machine to watch this and watch that and watch some, like no, no, no, no, that's way too much, I don't wanna have, because then I gotta monitor this agent and I gotta monitor that agent and then knock has to call me to restart this one and then restart, of course they're not calling InfoSec to restart the security agent, they're calling me, right? And when it's a pain point for me, I wanna do something else about it. So I thought, why don't we kind of, why don't we try to consolidate these down? Why don't we take these multiple agents and tie them all into the SaltStack agent? SaltStack has an agent that runs on a machine, it has a scheduler, it has custom modules so we can just say, all right, Salt, run this, Salt, run that, Salt, audit this, Salt, monitor that, let's compile them down or let's consolidate them down into one and leverage that instead of just loading more and more modules on normal or monitors on our machine. So I wanted to, based on my talks with InfoSec, I wanted to let operations do the driving but let security navigate. So have them tell us what the requirements are and then let us do them, not just keep it all on their side and then they'll just create tickets for us. Again, like I said, I wanted to have better insight into the compliance level of my machines based on their navigation or based on their requirements. So this was kind of the early, this is one thing that I really wanted to leverage. And I really wanted to have, again, insights by both teams and not just from one team. I wanted to get rid of the black box and the magic. So the next requirement, of course, comply with all the things. We had this meme pasted up on the walls. This last fall we had this big push for some compliance standards. We wanted to achieve this standard and this standard and so on. And so everybody was heads down, security, security, security for a number of months. And of course I have to have some memes in here or, right, it's not gonna, anyway. So I came up with a project. I kind of put it together in my head. I have, I don't know if you guys have the same thing. I have a lazy boy that I call my thinking chair. And I kick back at it and I think, and then I usually fall asleep. But it's still percolating in there while I'm sleeping, right? And then I wake up and I'm like, ah, I've got it. Right? So I came up with some ideas in my head and I thought we should do this. And I had to come up with a name for it. I've been accused in the past of coming up with boring project names. So I came up with Project Argus. If you guys remember this old mythology, maybe that's another dumb project name, but I like it, so deal with it. Yes? Nevermind. You're the worst, Corey. So this project is broken up into three different pieces. Based on the requirements from the security team, we needed to have a baseline set of security requirements. We also needed to do file integrity monitoring and watch for bad check sums of files and watch for changes in binaries and that kind of thing. And then we also wanted to add additional insights so we could look for what open ports are on a system bound to some binary, like is somebody running some rogue thing that's sending stuff to Russia or whatever? I don't know, right? So we broke it up into three pieces. The first piece is the CIS benchmarks. Is anybody familiar with CIS? Centers for Internet Security? Is anybody affiliated with CIS? Before I bad-mouth them a little bit? You have to comply with it. You have to comply with it, but you didn't write it. Okay. I like, okay. I like CIS in that they define a list of checks and benchmarks, you should comply with these things. What I don't like is I think they should hire a technical editor and have even a little bit of standardization between their documents. Like in the CentOS 6 benchmark, rule 2.1 is this check, but in the CentOS 7 benchmark, rule 2.1 is a different check and the Debian benchmark is a different check number and for a standard, they don't understand standards. Like it drives me crazy. Maybe it's a little of the OCD, you know, but yes, so if anybody knows someone at CIS or something, I don't know, hit them, Corey says, or tell them to hire like Larry or some technical editor or something and fix it. Anyhow, so this is the first piece and I'll talk a little bit more about this in a minute. And then we have the second piece is the file integrity monitoring. Like I said, watching for changes in binaries, we wanna gather checksums like SHA-256 or an MD5 or something of all the binaries in our systems. When a CVE is released, they usually say, you know, here's a known bad checksum from a binary. Be able to look for this, be able to query for that and be able to query our entire infrastructure for known bad checksums. Was another thing we wanted to do. And then the third piece, we decided to go with OS Query. Anybody familiar with that? Comes from Facebook, right? It kind of allows you to query your server like it's an SQL database. So you can run these select statements like select open ports comma processes where port equals the, you know, and put together these query statements that allow you to gather insight into running agents and SUID binaries and open ports and all kinds of stuff. It's a really, really cool tool. And there is a salt module for OS Query. So we can, with salt, we can send out and query all of our systems at once and then bring the data back just like we have this large distributed database system and then do comparisons and we, at Adobe, we send most of our data into Splunk and then we can create reports and all kinds of stuff. But again, this is the third piece. Now, OS Query can run as a daemon, but we decided not to run it as a daemon because again, like I mentioned, I don't want more and more daemons on my system. I want to consolidate them down. I have a salt agent and the salt agent does these queries for me. So you can just send a query right into the OS Query binary and get the results back without running it as a daemon. So we're essentially using the salt agent as the daemon to pass the query over and then get the results back. Now, so to go back to the first component, the CIS execution module, when I first gave this talk back in August, I hadn't quite open sourced it yet, but it is available on my GitHub now. BeWarned it only currently supports sent six and sent seven, but I have a new roadmap defined where I want to make it a lot more modular and I'll talk a little more about this at the end, the roadmap to support Daemon and pretty much anything else, any kind of test suite that people want and even make it a little bit more flexible so it doesn't require SaltStack. I think SaltStack gives us a lot of benefits, but make it supportable by Puppet or Chef or whatever. So I've got some ideas for that and again, talk about that in a little bit here. So let me do a quick demo of the CIS module that I wrote and hopefully the demo gods will smile on me and actually work. If I can get my console to show. Okay. Is that big enough? Can you guys see that all right? Two more from the front row. So the CIS module, I wanted to, it's a huge list of these security benchmarks. You should have these mount points. You should have no exec on this. You should have these security settings and so on and so on. It's probably 130 different checks and so I wrote them all into a Salt execution module and I made it so that I could run an individual check. I could run a group of checks or I could run the entire suite of checks with one command. So I'll do Salt and I'll target a machine here and say CIS audit 11. So audit 11 corresponds with the CIS document, chapter one unit one. Now if you can even see the results there but I passed a couple, I failed some. I can run it with details. So let me, actually let me run the whole suite here and it's a black. It was on black and I didn't know if that was very visible but I will try that better. Okay. So running the entire suite of tests only took about that long, that was 10 seconds. It's not that bad. I was really concerned about the performance. I didn't want to bog down these machines. So on this test VM I have based on the CIS benchmarks I've failed 57 of the tests and I've passed 55 of them. Not great but this is kind of just a default centerwise install. It doesn't have some of these things applied to it. Now these results are useful but I need more insight into this. So I can also run it with details equals true and it will tell me which tests I passed and which tests I failed and the name of those tests and what chapters and sections they correspond with. Now it's wrapping here because of the resolution but it says I failed. All right, create separate partition for temp. This is a scored item in the benchmark. I failed that one. So some of these partition things by default. Well, yeah, the default install doesn't set some of these settings. Remove the telnet client. That's one of the benchmarks which I've argued back and forth with the security guys. The telnet client is not a security problem. I use it to test port connections but it's unencrypted. Well, I understand removing the telnet server but not the telnet client. Anyway, that's another thing. Netcat and everything. Right, they've already just used Netcat which can also be unencrypted. So anyhow, so those I failed and these I've passed. So some of these honestly I think are dumb like remove the talk server. What the hell is the talk server? What's charging dgram? I don't even know but it's in the benchmark. So anyway, those are easy wins. Disable all these. So you see the test name and the test number. If I wanted to rerun test 2115, I could run just that test by doing audit underscore two underscore 115. And so I can run individual tests. I can run groupings of tests and I can run the whole suite. But this gives me a good quick, like I said about 10 seconds insight into the compliance level of my box. Now, I just ran this on one machine but because of the way that Salt is designed, it offloads it everything to the Minion. So I'm not bogging down a central box. I'm sending, I'm executing the audit on each Minion and then just reporting it back. So I could run this on 10 systems, 1,000 systems, 100 systems and we work fine. But this, what I really like about this is this gives me as the operations guy the insight into what the requirements are and where I stand on those requirements. And then what we additionally did to this was we created everybody's familiar with Salt States, like Puppet Manifests or Chef Recipes or whatever where you set the configuration down. We created a corresponding set of States that match up to each of these tests. So if I failed on 921, I could apply the Salt State 921 and it changes that setting and now I'm compliant. So I have a suite of States that go along with the compliance checks so I can easily just patch them, rerun the test, all right, I'm passed. Good, great. And then I go on and I do, I can do these one at a time if I want, I can apply a group of them and then I can easily and quickly patch my systems and then send this report back to Infosec and say, look, now I'm only failing three of these and I will argue on why that's okay. Yes. Yeah, I mean a little bit. We try to, so each of our product teams has their own Salt installation and so we try to work with them to update their existing States to inject these fixes. So we don't have, you know, they have a high state that does 90% of it and then another state that does and then they're walking all over each other. So yeah, there's a little bit of difficulty there but oh yeah, you're doing it wrong. Oh, okay. Yes, right? Right, so that is something that I wanted to add in the next iteration. This was strictly just kind of a proof of concept. So I wanted to not say here's the list for everybody, this is the list you have to use. You can define your own lists or you can define your own custom modules and your own custom checks. That's what I wanna do in the next builds because yeah, I don't wanna have to fail every time on I'm running Apache which is running on my web servers because duh, right? Oh, let's see. So, which that's another thing that really bugs me with the CIS benchmark standard. There's a whole chapter which I call the dumbness chapter. It says don't run Apache, don't run squid, don't run, like, well it's a web server. Of course I'm gonna, and if it's not a web server I'm not running Apache. My security team insists that, well, people are dumb and they'll run Apache on an Apache web server. Maybe it's just me, but I don't wanna fail my web server audit every time because I'm running Apache. There's a whole section of turn off like every service. Yeah, and I think this only checks for Apache, doesn't check for NGINX or LIDI or whatever, so yeah. So custom checks, custom things, that's in the next iteration. So, I think the wins that we get with these CIS benchmark checks is it's really fast. The checks, this was on VM and it took about 10 seconds. So if I'm running it on a bare metal box, it's probably gonna be even faster. It's very flexible, I can add and I can remove checks and in the next iterations I wanna add more and make it more flexible. And it's collaborative. The checks is just a Python module for Salt. So Infoset can add checks to it or I can add checks, it's all just in a Git repo and people can collaborate and add and remove checks and make it more customizable and so on, they do. So, we created some really pretty pictures in Splunk where we can return this data. So anybody familiar with the concept of an insult stack of returners? Instead of just sending the data back to the console, you can say send this to minus QL or send this to whatever. We've written a Splunk returner, so we can send the data directly back to Splunk. They can parse it, create some pretty dashboard with graphs and whatever and then the auditors are happy. I don't know if Scott open sourced it yet, but it totally will be, there's no reason not. We wouldn't, yeah. You want that too? Okay, okay. Oh, all right. So these I think are some of the wins. There's been a couple of questions about potential flaws here, there, or how do we address this and again, in the next iteration, I want to improve on all those and address those and because I want to open source this, I want feedback and I want contributors and if you have ideas on how to address the difficulty with competing states or something, let's do that. If we want to create prettier output or something, let's do that, let's work together on it. Now the next piece is the file integrity monitoring execution module, also open sourced in my personal GitHub. Now what this does of course is it's actually broken up into four steps. There's an execution module, but then there's also a salt orchestration runner that runs each of, that runs all four steps and brings the data together. So first of all, it gathers the data, so it gathers the checksums of all the binaries. The execution module is flexible, so you can tell it what type of checksum you want to gather, whether it's a 512 or a Shattu V6 or an M25 or whatever. You can give it a list of targets. If the target is a directory, it'll walk the directory. If it's just a binary, it'll just do that binary. And you can configure where does it store the results. One thing that we found early on with this is that gathering all this data for all the binaries across all the systems was a lot of data. So early on we thought, okay, that's, I mean, we're already running two and a half terabytes a day through our Splunk. We didn't want to make it worse. So what we did was, let's gather these every day. So it will gather all the checksums and save them on the system. That creates our day one benchmark. Then the next day it runs it again, creates a second file, then it diffs the files and only reports to Splunk what the changes were. So if this has changed or this was removed or whatever, that's the only thing that's going to Splunk. So instead of thousands of lines, it's a dozen lines maybe if we apply to patch that day, right? So in this iteration, it will gather all the checksums, save them on the machine and then date them and then diff them and so on. And then I put together the orchestration runner that goes out and says, all right, everybody do your checksums, save them. And then the next step is the master will go out and say, all right, everybody give me your diffs and they're all collected back centrally to the master. And this was before we wrote the Splunk returner. So we were running a Splunk forwarder on the master and it would read in that log and that's how it would get the daily diffs. But now we just want to do the returner directly from the minion. So it does the diff and then sends it directly to Splunk and we kind of move the master out of it. This, if you can see this, this is the four steps that it does. This is the salt orchestration runner definition. So it says, all right, everybody do your checksums. All right, then use the CP push module and send them back to the master. All right, then run the FIM diff execution module and then send it. So it's orchestration as code. Do this, then this, then this, then this and do it on these targets and gather the data and so on. So it's actually kind of simple. There's just the functions inside of the FIM module and then it runs those functions in a certain order to gather that data and send it back. Yes. We've run this through the salt scheduler, but yeah, on a daily basis. Yeah, and because they're all coming back to the master, again, well, that's something we want to improve on and send them directly to Splunk, but we got to use Splay with this to say, all right, don't, everybody do it all at once, but let's use the salt scheduler to send it out. I'm gonna try and demo this one, but as is normal with presenting, I was fiddling with it earlier, so it might not work, but I'll give it a try. But for this, I can just do salt run. Oops, state.orchestrate. That's not my bug, that's an upstream salt deprecation. Dave, you can fix that. Dave is a salt engineer, if you guys have salt questions? Red Hat has to fix that. Okay, that's Red Hat's bug. Ah, no, we need to smash this. So I broke this one, but I promise it worked yesterday. That's my fault. Anyhow, you kind of saw what it was doing with this, if you can see this, but it would target out everybody, all right, gather all your checksums, save them, and then we can do a diff, and then we can collect it back and we can send it. Yes, Corey, again. Sorry. It really depends on what the targets you define are. Like if you wanted to scan everything on your box, that's gonna take a long time. We limited it initially to just the paths or the directories that are in your path, so most of the binaries, and actually, this is a demo that I can run. I can't run the whole orchestrate thing, but I can run, so I wanna run FIM.CheckSum on this list of targets. I can define the targets, and it doesn't really take too long. Now, that's not a huge list of targets, but it's gonna walk through all those directories and scan every binary that's in those. But, you know, there you go. By default, it's shot 256, but it's configurable. You can define whatever checksum you want, and it gathers all of this data. A time, checksum, C time, GID, group, hostname, inode, mode, M time, size, target type, UID user, they want, our InfoSec team wanted, like, everything. So if the ownership's changed, we wanna know about it. If the M time changed, we wanna know about it, yeah. It's actually, it's a combination of two existing salt modules. There's the file.getHash, and I forget the other one, but it's just, hey, salt, go run a checksum on this file, and then it parses it and puts it in this little YAML output and stuff. So, that didn't take, is that too long for you, Corey? No, no. That's acceptable. So, there's that piece. Here's some of the stuff that we put together in Splunk, which I don't know if you can see. This is rare targets. So, binaries that we're only seeing a few times across the environment. So we found, like, two instances of AB, or two instances of area check, I don't even know what that is, but based on the information reporting back to Splunk, the security team can then run these tools and say, why do we have this one binary out on all of our systems that seems to be non-standard? And then they can, because we're gathering the host name that it's on and all that other detail, they can go track down, all right, this host has this one binary. Let's talk to that team. Let's see why it is on there. Let's remove it if we can. We also gather rare permissions. So, some of these are 0655. Why is it 0655? That's, or 0510 or whatever. These are rare permissions that we find based after gathering all this information and parsing it and generating the reports. So we can go in and look for weird permissions on files and easily track that and see it. And then again, the last piece, the OS query execution module, this was newly added in SALT 2015.8. I believe the original developer of this module for SALT is Gareth, one of the conference organizers. So, awesome for him, I don't think he's in here, but we are loving this at Adobe and so I need to thank him for building this module. I went to build it and I've found this a number of times. Corey is probably seeing this too. You go to build a module for SALT, you go check really quick. Oh, already done. Oh, already done. Some, SALT, what's the quote? SALT reads my mind in the future, right? Your own blog, right? Yeah. Yeah. So, let me demo quickly the OS query module. Again, OS query allows you to send queries to your system kind of like a select statement. So, I've got some of these in my history here, I'll just go back up. So, this is gonna query for name, address, port, path, command line, for processes, anything essentially that's bound to a network address that's not local host. So, we wanna look for anything that has external facing connections. So, we can feed these queries into SALT and run it across all of our boxes and boom. That was really pretty quick too. So, let's see what we have here, number of these. So, SSHD, that's safe enough. The SALT master, XINETD. So, it looks like it's grabbing for IPv4 and IPv6 and all the information about the path and the port and the PID and everything. So, we've put together kind of a long list of these queries. We stored them in SALT's pillar system so we can update them real time without having to restart daemons and stuff. And so, the OS query, we have an OS query runner which will query all of the queries that we need out of pillar and then run them. We put that into the SALT scheduler and it feeds it into Splunk and we can generate, why does this one box have an outbound connection on some random port where it really shouldn't? We've got a couple examples of queries here. So, that's the one I just did. This is processes. I didn't write some of these queries, I'm not like a DBA, I don't have to be a DBA to run this but, you know. So, this one found the Monit daemon, the MD5 of it, SHA 256, how long it's been running, all that kind of stuff, SALT minion, SALT syndic, that one. So, yeah. So, the OS query module is open source, it's upstream SALT, it's not something that we made but it's something we use. And the queries, I don't know that we've published the list of queries we're using, I don't know if that's of interest to people or if that should be part of the project or not, but you can. What was the example of the reason? Yeah. So, what's coming, what are the next steps here? We, like I said, I demoed this at Adobe last fall. It got the attention of the security managers and then the security director and then the VP and then it kept going higher and higher and I had to demo this to the CISO, which was a little nerve-wracking. And his first issue was, well, we already have this tool that we're using. Why do we need another tool? Why do we need to recreate the wheel? From my opinion, it's, well, we don't have to spend a million dollars on this other tool, we can write our own and we can just hire somebody to do this, right? And we, honestly, we get the results faster, we have better insight, the tool that we have does essentially the CIS benchmarks and the file integrity monitoring, but it can take up to a week for us to get the report, which is totally not useful at all. We get these back in 10 seconds, right? And all the functionality of OS Query is just icing on the cake. None of that functionality is in the tool that we currently are replacing with this. So they like that and they've given us to go ahead to keep going with it. The one concern, though, is Adobe has two major business units. We have digital marketing and digital media. Digital marketing will be using this. Digital media is still using the old solution. So one of my challenges is to make this flexible enough so it doesn't require SaltStack. I think SaltStack, again, gives us some wins, but it doesn't require SaltStack, it will just require Python. If you can run Python, you can run these audits and get these reports. I wanna make it more flexible. Right now, the CIS execution module is about 3,000 lines long. It's got 130 checks in it. It's one Python file. It's too big, it's unwieldy, it's difficult to maintain. I wanna break it all apart into individual checks so you can then just piece together and say I want this little collection of checks or this collection of checks. It's not an all or nothing thing. I wanna add support for Debian and Ubuntu and Arch and whatever, and if contributors wanna support OpenSUSA or Slackware or whatever, make it flexible so it supports multiple distributions. Right now, the current state, again, only supports sent six and sent seven. SaltSyndic? Oh, yeah, yeah, yeah. We're using that. We have a central master and then multi-tiered masters underneath and we run this and it currently supports and passes through the syntax and sends the data back up to the top. So that's already in there. Trying to think of some other things, but essentially, we wanna make it more standard, more flexible, I want more people to be able to use it. I don't want it to be, although I'm definitely a salt fanboy, I don't want it to be specifically tied to salt, although I may add extra bells and whistles in the salt version. But so if you're interested in this project, you know, hit me up on GitHub or whatever, email me or something. I'd like more collaborators, I'd like more insight and more feedback on how we can improve this and make it a shareable open source project that lots of people can leverage. In that regard, we are hiring somebody to help me build this. Adobe is looking for a Python developer, preferably with some SaltStack experience to help build this out, mature it, make it available and we're open sourcing it, which is huge, I really like that, that they've said okay about that at work. You would probably have to move to Utah, so you're probably a bunch of you just checked out. But we do have an office in San Jose and San Francisco, you could probably work out of those offices if you're near one of those. Anyway, if you're interested, you can email me. Beyond that, here's my contact info, here's my GitHub. I've got some old crappy projects on my GitHub, don't pay attention to those, just pay attention to the salt ones. But who doesn't have really old legacy crappy code on their GitHub, right? I was just trying something out and it's on there, but I should probably delete it because it's embarrassing. Anyway, so with that, is there any questions? Yes. So that everyone can hear. When you were speaking about the file integrity monitoring, it sounded like you were storing the files locally on the machines. Is that actually the case and doesn't that have some substantial security implications? So, maybe I should clarify, because that came up. We gather checksums of the binaries on the local system and then the master gathers those results and brings them back to the salt master. So they're removed from that local box so nobody can tamper with them. If somebody had access to our salt master, we got bigger problems, right? So we immediately grab them, bring them back to the master, do the diffs, process the data and send it to Splunk. So we tried to, yes, that was a concern that we addressed. Cool, I have one other. This is very interesting. How much of it is highly specific to salt and how much of it could be brought into another configuration management system? Right now, it is fairly salt specific. But like I said, in the next iterations, I wanna make it more generic and more flexible. So if anybody wants to help with that, patches are welcome, right? That's what we say. So, all right, thank you. Yes, we got one over here. I have one and a half questions. First, are you allowed to mention the name of the product that the security department was using? I can, but how about I tell you after? Okay, and so the other is, when you put this on GitHub, then all of a sudden the world knows some of the checks that you're using on your servers, did security have a problem with that? We have additional checks that we use internally. The stuff that we've published is based on public standards that anyone has access to. So I suppose it gives some insight into what some of our checks are, but if they're patched, then we should be safe from them. So that has come up as well, but I think we're comfortable with it. So that was the argument you had given the security people when that came up. This is publicly known anyways, that these checks are standard. Yeah. Okay, thanks. We have another one over here, I think. Have you looked into salt master list and demon list things for places where you don't actually wanna run Assault Demon, where you can just run states locally whenever you run Assault Call? We're not really using any of that at Adobe, so I haven't really looked into how it would work in that situation, but I can see that would be a valid use case that we could explore and see what we could do with it. We've looked into a little bit of doing things like that where if people don't actually wanna be connected to the salt master, we can give them all the states and they can run it locally and do their audits locally as well. Yeah, that's something we could look at into the next iterations. Anybody else? No? Okay. Well, thank you for your time. Testing, testing. Does this sound okay? Reasonable level? All right, I won't get much more excited than this. Okay, very good, thank you. Session, so thank you for being here. Great, thanks. Thanks, welcome everyone. I also wanna thank scale for this opportunity. This is a fantastic conference. This is my first time here and it's really just incredible. And there's added bonus. So I live in Raleigh, North Carolina where Red Hat's headquartered and they're iced in right now. It's one of these disastrous southern ice storms where the power lines are coming down, gusty winds. So everybody's miserable. And I escaped, I know. Yeah, unfortunately, next stop is Belgium and then the Czech Republic where it's gonna be even colder. But hey, I escaped the great snowpocalypse in the south just by a few hours this week. So this is really just fantastic to be here. So let's talk about pulp. Here's the problem that we're trying to solve is that managing and distributing software is really messy. There's a lot of problems when you try to insert yourself into that process that someone like Debian or Fedora or Red Hat or any number of proprietary vendors do of making repositories of software and other types of content available to their users. One of the problems that there are just a tremendous number of types. So we have, of course, Debian packages and RPMs that have been the classic staples for the last decade plus of the Linux ecosystem. But then you have things like puppet packages. And we're talking about how Chef packages may be their stuff. How Ansible doesn't really package or version their stuff. We have Docker images. We have just other types of system images, ISOs. OpenStack has some image format options. How do you manage and relate all these different types of content to each other? How do you manage them in a sane way without just using one, two, or three different tools for each different type of content? And what happens when these repositories change? You get new content. Software engineers are always fixing bugs, releasing new features. How do you make decisions about an update for a package came out? Do I put that additively into a repository? Do I replace the previous version and yank that because it had some critical bug and put the new one in? When is it time to say, here's a clean repository, I'm making a break, and version 3.0 is going in this clean, brand new version, a clean, brand new repository? Similar to how a distribution, we get Fedora 22, and then we break compatibility and get all new versions of things in Fedora 23. How do you draw those lines? And how do you have tools that help you draw those lines? And then locality of content is really a big problem. So Red Hat, for example, is the use case that's on my mind the most for obvious reasons. We distribute a tremendous amount of software via a CDN. So we use Akamai primarily to distribute software to our customers. But then what do those customers do? Do they have access to the internet all the time when they want to have access to this content? Maybe not. We have a lot of really interesting use cases. What about inside a public cloud? Do you want to be pulling content, if you're running in Amazon EC2, do you want to be pulling your content from the public internet all the time? Or can we help you get onsite mirrors of content? And then what about your own content within your infrastructure? Do you have a development team that's producing your own packages and you need some software to help you manage those repositories and deliver those packages to the right parts of your infrastructure? And then with locality comes control of access. And this isn't really like DRM style control necessarily of like I want to make sure only people have paid get this, but more along the lines of stability and quality control. To say we have the brand new bleeding edge bits that just came out of development this afternoon and that is in a repository that our testing environment has access to. And we can promote that to a different set of repositories that our production environment has access to. And we draw a hard line in between and ensure that these different sets of infrastructure only have access to the software that's been through whatever other process for quality assurance we want to put in place. So with that, let's talk about what is pulp? Primarily it's about managing repositories of content. We started not surprisingly with RPMs and the young family of content. So a lot of things in there is RPMs and Delta RPMs and source RPMs and various other things. Arata gets into an interesting area. We quickly discovered that the concepts of managing a repository are fairly common across types. So then we looked at puppet modules and discovered we can have a core set of tools that knows how to take a piece of content and take a repository and associate them with each other. And we can have a lot of pieces of content and make associations with a lot of different repositories and mix and match, move them around and have a whole REST API and a whole common set of tools and user interface that knows nothing about actual specific types like what is an RPM or what is a puppet module? What do you do with this? The stuff in the middle doesn't have to care. So pulp supports a lot of different content types. We're gonna list a bunch of them soon. But that's one of the primary goals of pulp is to be this general framework that you can plug support for many different types of content into. We do have this pull through cache feature. This is brand new. It's in our 2.8 beta right now. We've not released it yet but the release will be forthcoming in the next few weeks hopefully. If you're really interested in that we're gonna talk in detail about that soon. You could help us test it and try it out. That's a very exciting feature that's been missing especially from the YUM and RPM family for a long time. I've certainly missed it. Pulp's open source, totally open source, GPL. It's on GitHub. You can go see everything's developed in the open just as you'd expect. You can see the pull requests. So you can comment on our pull requests. See how the team reviews each other's code. We do peer review on everything. We always love contributions even if it's just bug reports or compliments or beers after the conference or anything. I think you might like to share. And Pulp is a Python web application. Now before you get your hopes up being a web app does not necessarily mean that it has a beautiful graphical web interface. Pulp does not. Pulp has a REST API and a command line interface. This doesn't rule out the possibility of a beautiful web graphical interface in the future but right now we have a command line interface that we'll get to know in just a minute. So let's dive into what we do with Pulp. There are two primary things that we're gonna demo on. So the first is we wanna create a repository. Now don't think about this in terms of like running the create repo command line tool that writes some files on disk and now you have a yum repository of things. This is a more abstract concept of a repository in Pulp. It's really just a record in the database. It's empty. You tell Pulp what type it is so it can get the right plugins plugged into both sides of that repository. Now that you have your repository you wanna get content in. So there are two primary ways of doing this. One is to synchronize with some remote repository like Fedora or like Debian. So you would give this repository a URL and say I want you to synchronize with that remote source and it'll grab all that content and pull it in. You do that manually or you can schedule those kinds of operations. And then uploading is the other primary way of getting content in. So Pulp's REST API allows you to provide a file, tell Pulp what type is this. So you'd say I have an RPM. Here's this RPM. I'd like you to put it into this repository over here and Pulp will do that for you. And then the third way is once you've got content in the Pulp now you can actually move it around to different repositories and make copies. And an important property of Pulp is that copies are cheap. They're really just a record in the database. So if you have 10,000 RPMs in a CentOS repository and you want to, for example, we asked about snapshotting, make a snapshot of what does this repository look like at this point in time. You can copy all 10,000 into another repository. Look at a split. They're just a bunch of association records in the database. And now you have a snapshot point in time. So you can change that original repository however you like. Once you have your repository populated with some content you're ready to publish it. So this is an important distinction in Pulp conceptually that the act of changing repository doesn't necessarily make those changes visible to the clients that are consuming that content. So it's not until you run a publish operation that, for example, Pulp does the equivalent of running CreateRepo and actually writes some files to disk. And now the content you've put into that repository is available to clients over HTTP, for example. Although that's not necessarily the only way you can distribute content. You can also Pulp has the ability to write out content repositories to ISOs. It'll generate ISOs. You can tell it how big they should be. We have a community group that has written another plug-in. They'll actually r-sync the contents of a repository to some remote location, which is pretty nice. And for Puppet modules, we actually have a really interesting use case. Rather than making the act of publishing their repository, we had a very strong use case for running Pulp on the same machine as a Puppet master in wanting to, the act of publishing, to actually install the Puppet modules into a local environment, which is really just a directory on your local file system of the Pulp server. So that's what the publish operation there means. It could mean many other things potentially, like distributing your content with BitTorrent or anything else you could think of. So with that, let's do a quick demo, and we'll see how this goes. This demo does, in fact, require the Wi-Fi to work, but the Wi-Fi has been remarkably performant in this conference. We'll see. Is this readable in the back? Oh yeah, yeah, let's get the, if we can get the lights down now, that would be the time. Cool, better? Okay, great, keep me posted. Okay, now I can't see my keyboard. It'll be fun. So Pulp Admin here is the command line interface to Pulp, and Pulp Admin itself is really just a REST API client. It's laid out hierarchically, which makes it easy to plug in new commands for new content types that you add. So in this case, we run Pulp Admin, and the next command we see here is, you can see the cursor, great. RPM, so we wanna do an operation on RPM content, and we wanna look at a repository, do an operation with a repository, and the action we want is to list. So we've listed and we see there are no repositories. So let's create one. So here's a command to create, so we have that same RPM category, we wanna look in a repo, we wanna, the action is create. So now I've given it a repository ID for this thing I want to create, called it the zoo, and I've given it a feed URL. This is a remote, yum repository, it's live on the internet. You can also go and install packages from the zoo if you choose, it's a great testing repository. So I've created it and we're done, success. But we don't have anything yet. We can do this list command again, and we see we have this zoo repository, but there's no content. So as we discussed before, the next step is to do a sync. So we're gonna synchronize, and run here is distinct from schedule. So this is the other option. We could, instead of syncing right now, we could schedule this operation to either run once in the future, or run repeatedly on some specified interval. And I've told it which repository I want to synchronize. This is gonna fly by a little bit, but we're gonna scroll back and parse through a little bit of what the real action is. Okay, we're downloading metadata. Now we're getting RPMs, oh, that was real fast. I got all 32 before we could get a progress update. Published, okay, we're done. So let's look at what just happened real quick, and then our demo will be over. Okay, we started synchronizing. Pulp downloaded some metadata. This is essentially like a manifest. What's in this remote repository? It chewed on that for a moment and decided which of these things do I already have, which ones are available, make some decisions about what I need to download, then it downloaded them. Looked for some other types of content that it didn't find, not important. That sync task completed. Now by default, Pulp will then kick off a publish operation, which it did here. So it initialized some metadata. It started writing to that metadata. It wrote out all 32 RPMs to those appropriate metadata files. Did some other things and we're done. So that's the general user experience of Pulp. Oh, and now we can run this list command one more time and we see that we have some content now. We have 32 RPMs, we have some package groups, we have four erotum. This is the way that RPM type distributions express such as things that bug fixes that are available in a standardized way. So that is the end of the demo. Any questions? I kind of tear through this kind of fast. Just raise your hand anytime you have a question or yell at me if you don't see me. Let's go back to our presentation and we can get the lights back on if you like. Thank you. Where were we? We're at the demo. Okay, great. Content types. These are the content types that Pulp supports right now. Of course, the RPM family, Docker images are a very important one right now, very popular. Very interesting exercise in keeping up with all the change that's happening there. The change from their V1 API to the V2 API. They've added a lot of really exciting features but at the same time have changed essentially everything multiple times. Pubbit modules are another very important one. Also similar changed a lot over the years. Several different versions of their API. Each one better than the next. Python packages, OS tree. If you're not sure what an OS tree is it's worth googling or whatever your favorite search engine is. It's really fascinating technology. It's the foundation of things like Project Atomic which is another thing that's worth knowing about. Regular files, you can just have repository of whatever old files you wanna stick in there. We have community support for Debian packages and then we have a community user who's developed a plugin for NPM but has not shared it with us yet. So we're eagerly awaiting that contribution. Let's talk a little bit about who uses Pulp. Red Hat Engineering was customer number one. All software that Red Hat distributes to our customers and to our users is managed by Pulp. Internally there's a giant Pulp has, I don't know how many hundreds or thousands of repositories available. A tremendous amount of content. Public Clouds, if you've ever in Amazon EC2 started up a Red Hat Enterprise Linux AMI for example and then installed software you have pulled that software from Pulp. So we have Pulp running under a different name called the Red Hat Update Infrastructure which is basically Pulp with an extra user interface on top of it to do a very specific workflow. Most of the public clouds have this running inside them and the images that you would instantiate have the information baked into them that will automatically pull from for example in Amazon your region, whatever region you're in. So you get the data faster and of course free bandwidth. Catello is the upstream of Red Hat satellite six. This is a whole content and like systems, life cycle management project and product that we could talk for hours just about that. That's one option if you're actually looking for some more help in terms of having a graphical interface and a set workflow that uses Pulp to manage content and go through a promotion workflow. Catello and satellite are a fantastic option. Bless you. And of course we have a thriving community and a growing community that we'd love to add you to as well. So here's a specific use case. This is a real basic one is just to mirror some content. Python packages are a nice one and a popular one but it could be any of these content. So you can synchronize packages directly from the Python package index. You can add and remove, get exactly the packages and exactly the versions that you want and keep them. Here's, I've been using Python for a long time. This is not such a big problem anymore I think but years ago it was a huge problem that sometimes versions would disappear from the Python package index. You'd be using some like Django, whatever at some version and depend on that version and build all your custom stuff around it and then when version.next comes out the old one won't disappear and is endlessly frustrating. So Pulp is one way to take control of that and have your own well curated slices of that upstream content. Exactly the versions that you need. Exactly where you need them. So another use case is this promotion idea that I alluded to earlier of development, testing, production and essentially we accomplished the promotion by using the copy operation. You start with a repository that either your development team uploads directly to or maybe your Jenkins or whatever continuous integration or build system that you have is automatically uploading content, dumping it into some Pulp repository and that is attached to maybe just some basic testing framework or testing infrastructure that will run some automated tests and give a thumbs up or a thumbs down and if the thumb goes up, then that can get copied all of that into the next repository that you've set up which would be some flavor of testing or quality assurance and maybe some lab has access to that and you can see where this is going. Finally to production, which could either mean deployment if you're in that business or production could be like in more of Red Hat's example, just a CDN or some public space where you're making that software available to the people who are going to use it and deploy it. Yes, great question. The question is in the promotion workflow does it handle dependencies? The answer is sometimes. It depends on the content type and this is one of the messy parts of dealing with all these different types of content is what does a version on an RPM mean versus what is a version on a Python package mean? Different things. The algorithm for taking two versions and comparing them is different for all these different types of content. Oh, that's incredible. And the way you express dependencies is a lot of work to keep up with it all. So right now RPMs, I think are the only content type we have that supports dependency resolution. You absolutely can copy and say, I want you to copy VIM from this repo to that repo and pull all of this dependencies with it and it will do that. The questions? That's certainly one option yet. So the question is more detail about the curation process. How do you choose what you want to promote or not? Essentially and perhaps even what do you want to back out? There's a whole search interface. When you do this copy, you can either pick and choose, I want these specific names or these specific versions of things, names and versions. Here's a list, copy only those things. You can remove things in exactly the same way or you can just go whole hog and copy the whole thing over. The power's all in your hands with the tools. Let's keep moving. Oh yes, please. Oh, this is a very good question. Is there a way to find the difference between two repositories? This has been a feature that's been requested in the past. We don't have it. We have, we are on the cusp of offering that. So stay tuned. I think shortly after pulp3.o comes out, which I hope will be in the next year or so, maybe six months even. That is going to be a possibility. There are some tools for these specific content types that will let you compare. RPM I happen to know, I think has at least one tool that will help you compare two different repositories. But not yet, but that's a very asked for feature. So Pulp's a distributed application. We're just going to talk through the pieces a little bit so you can see what you can do with Pulp and how you can scale it. So there's the REST API. There is content served via HTTP. So as we discussed, it's not the only way you can serve content, but by default that's what Pulp does with most of these types of content, serve them via HTTP. And then we have these long running jobs which kind of throw a wrench in the gears of what you would think of as a normal, simple web application like this. So here we have these smiling users on the top left and they're interacting with a web service. And this web service has to be HTTPD, Pulp mostly deploys and is tested with Apache web server, but doesn't necessarily have to be tied to that one. Normally you have this web service, it talks to a database, it keeps state there, and that's the end of it. But having these long running jobs really does throw a wrench. So imagine one of these users that we have on the left says see that repository over there with 10,000 or 100,000 RPMs, I want you to publish that now. That's gonna take substantially more time than we can reasonably wait before responding to an HTTP request. So we need some way to deal with that. So we added this extra stuff. We added an AMQP message broker, we support a couple of different ones, and this pool of workers. And there's some magic that happens in those workers, actually it's not magic, I actually kind of don't like when people refer to software as magic, it's very well defined. And very well thought out. But for these purposes we're gonna think of it as a black box, because I could do a whole presentation on the guts of the algorithms of how that works. In terms of prioritizing and keeping different resources separate from each other. Anyway, I've already said too much. Let's trace real quick through how a request flows and bounces around through here, kind of like pinball style. So this user in the top left we have here. Imagine that he wants this publish operations. He's requested, I want to publish this 100,000 RPM repo. That request goes to a process that's handling web requests. That process starts doing web application type things and adds a record to the database. It puts a record in the database about this job, some information about what repository this is, what kind of action you can imagine, what kind of things might go in there. And once that information is there, it'll put a message on a queue on this message broker. And from that point the web process is done and responds back to that user and says, okay, I queued a job and here's a unique ID you can use to track that job over time. And at some point in the future, one of these workers will notice that message on the queue, it'll grab that message off the queue and go access the database and now get to work and actually do that work. The beauty of this is that you can scale these different pieces independently. So if you have a workflow where you have a tremendous number of client machines that are accessing content very frequently, but your content doesn't change very frequently, then maybe you want to scale up your web processes across a lot of infrastructure. Just throw hardware at it. Conversely, if you have a lot of churn, for example, this is one story I enjoy telling, we have one community user. They have hundreds of thousands of RPMs that used to be all in one repository. It was kind of nuts. We talked them into splitting this apart a little bit. But the general use case is this. They had a whole bunch of small web applications, small, slightly reusable web applications. And if you imagine this matrix of web applications and then all these different skins that they put in terms of branding and who knows what else, all these different permutations of packages that need to go together and be built together and be shipped together. And they had these individuals, small, very agile teams that worked and made a little changes on these things all day every day. And they would rebuild essentially everything every 10 minutes all day. So just incredible amount of churn. And that was actually part of our inspiration and motivation to make these different components of PULF very scalable. So they're now able to scale out this worker pool and go either just scale up a tremendous amount of infrastructure and retain a huge pool of workers, or we can facilitate this cloud burst concept of at midnight every night, everything gets rebuilt and retested and we only need to spin up some new virtual machines in some public cloud or some private cloud for that period of time when it's cheap or whenever it's opportunistic. So we can just add a bunch of workers, do a bunch of work and then throw them all away. On to the next topic, PULF's extensible. Getting back to that concept of there's a general way you can approach managing or repository and associating content with it and not caring about the details of what is this piece of content and how does it work? How do I use it? PULF has this central core with all these tools and then what you as a plugin author provide is just how does content get in? What's some kind of funnel that you can use to hand content into that core? And then when a user asks a PULF to publish something, provide a distributor is what we call it, that's the piece that PULF says, I have all this content, the user asked me to publish it so here you know what to do with this, go, do that. So how does content come in and how does content go out? Are the two important concepts that we've separated from the core? So all we need are three primary things. First, a type definition. So again, PULF is a Python application. We use Mongo engine right now. We're in the process of migrating toward using Postgres instead of Mongo. As you can imagine, that's a big deal. It's taking a lot of time and we have to be very, very careful. But in any case, you define your type. So for example, Docker blob or Docker manifest and you tell in this very simple class, you tell PULF what makes this unique? For example, with an RPM, anybody recite what makes a unique RPM off the top of your head? Never, yes, you get extra stickers at the end. Name, epoch version, release and architecture. For a public model, it's just the name and the author and the version or author you might call username. They've kind of switched terminology. Anyway, you tell PULF, what makes a unique one? And that's it. Now you provide an importer. This is the thing that implements how do you get content in? How do you interrogate a remote source and figure out what's there? What do I need to grab? Actually download it and then stuff it into PULF. And then there's a distributor. This is what gets you from, I have all this content in PULF sitting here. Now I want to write it out to disk as a yum repository. Or now I want to serve it through the Docker V2 API or the one of the several different puppet APIs to my puppet clients, for example. And that's it. Those three concepts are how our plugin writers add support for extra types. And those plugins can be versioned separately from PULF which has given us some nice flexibility. So any questions about this? I'm just gonna move on. Good question. How do you deal with versioning of plain files? You don't. But PULF doesn't give you that. PULF manages unique files. It establishes them as being unique based on their checksum and file name. That's correct. You cannot assign a version. It would be an interesting idea. It's something that we've thought about. And if you have some specific ideas, you could maybe brainstorm afterward about it. That's something that there's a general need for that. I would strongly suggest that semantic versioning is the right way to go. If you're not familiar with that, go to semver.org and read and be enlightened. That is largely, in my humble opinion, the right way to do versioning. And it's the way a lot of different types of content do their versioning. Puppet modules, for example. So we could add that, for example. That could be an option in PULF in the future. The command line interface is also pluggable. This is not very interesting, but suffices say that hierarchical layout that I demonstrated is the way that you would then, in addition to RPM content, add a new branch of the tree called Docker or Puppet or ISO or whatever other content you want. In terms of integrating with PULF, this is where the exciting part happens. If you want an out-of-the-box, ready-to-go experience where you've got promotion workflows, you can group different types of content together and promote them and snapshot them, something like Catello is really the way to go. If you are, for example, Red Hat Release Engineering, you have very specific needs and specific workflows in terms of build infrastructure and test infrastructure and quality assurance and signing infrastructure, being very strict about who gets assigned packages, what's that process of re-injecting them into PULF, and then what's the gait of releasing? There's embargoes on bug fixes where we're not allowed to talk about and everybody agrees that we're gonna release a bug fix for some security thing at noon on this date or whatever it may be. If you have a lot of process and custom workflows like that, now you wanna integrate with PULF. So we have, this REST API is well-documented, it's generic, so you can manage all the different types of content. One REST API keeps it very simple at least, simple as it can be. I guess I shouldn't oversell that point. It's still a messy business, as I said, right up front. We publish events to an AIMQB topic exchange. If you want to respond to events, for example, after a publish has succeeded, I want to now kick off a test job with Jenkins to consume, maybe install those packages on a machine and run some tests or whatever, you can do that. We also have HTTP callbacks, a similar way. This is what Catello largely uses at this point to figure out what is PULF doing and when, and when does it need to respond to something. Okay, this is a very exciting feature. Oh yes, question. Yeah, good question. The question is essentially, is there a Python library that is usable as a REST API client to PULF? Yes, we call it a bindings library. You can absolutely install the PULF bindings by themselves and use them to call all these operations. There is also a Ruby gem, I guess, I don't speak Ruby, but Catello is written in Ruby, so I have to at least tolerate Ruby, I'll say. I don't know anything against Ruby, I just don't really know it. But they have also a library in Ruby that they use to talk to PULF. I don't know how easy it is to install by itself, but if anybody has an interest or a need to interact with PULF from Ruby, that would be my first stop is to check in with Catello. Or just get on the PULP email list or the IRC channel and we'll point you in the right direction. Ah, K, excuse me, K-A-T-E-L-L-O. And again, Catello is the upstream of Red Hat Satellite 6. Catello brings together, oh, who's familiar with Forman? Cool, so Catello brings together Forman and PULP and this other thing called Candlepin that's really only important to Red Hat, but it's the entitlement management part of Satellite. Catello brings those other three things together into this unified workflow and user experience. Okay, pull through Cache. This is the exciting bleeding edge brand new feature. I used to be a Debian and Ubuntu user for a long time until I started working at Red Hat. I still sometimes use Debian and Ubuntu at home. I enjoy it, I don't like getting into distribution wars or anything like that. There's a lot of really interesting merits on a lot of different sides. I like keeping my brain in different areas and seeing what are other people doing? What are the best practices over there and what can we learn? Anyway, one of the things I really liked about my time with Ubuntu and Debian was this thing called Apt Cacher that is a simple pull through Cache where if you had a lot of machines sitting in your private network and you wanted them to not each all download an entire copy of the base install of Debian, you could use Apt Cacher as your proxy and it would do deduplication, do the right things. Sometimes it worked better than others, but it was there, it was useful, it was very, very useful. That's never existed really for RPM. There've been a couple of attempts, there was an attempt to add RPM support to Apt Cacher and that seems like it didn't go very well. So Pulp is on the brink of releasing this feature. It's first supporting the whole YUM family of content and then once we get that out and proven and tested and okay, we've done this sanely because it's a huge complex feature, then expand that out to all the different, well, most of the content types we support. There are a couple that for technical reasons really are not a good fit for that model. But so this is currently available in our 2.8.0 beta. You can go to the Pulp website and find the beta repository and install it and play with it. What it basically does is you still create your repository. Actually, you know what? I'm just gonna pull this up and show this real quick. I don't think we need to worry about the lights. I'll increase the font size a little bit. Wait, did that work? That did not. Okay, we're gonna look at the help text for the RPM repo create command and we have all these knobs we can turn on. So sorry about the line wrapping. Okay, the download policy at the top here is how you tell Pulp how you want to make your content available. Immediate does what Pulp has always done. Background will finish your sync very quickly, add all these records to the database and then start a second job later to go download the files in the background. And in the meantime, you can go ahead and publish your repository. You can start copying things around. All that content is usable through the Pulp API while it's being downloaded in the background. Third option is on-demand. Pulp does not download any of the actual files until a client requests them. So we'll look at exactly how that works here. This did not stay on the slide I was on. Very good question. Am I in approximately the right place? Here we are, okay. The question is how does background downloading scale if you want to background download a whole bunch of repositories at once? To some extent it's up to you to manage that. So for one, you can tell Pulp how much bandwidth it's allowed to use for each download operation. You can also control how many workers are available to Pulp and it won't do, it'll only do one job per worker. And that jobs include syncs and publishes and these download operations, all these things. So by limiting the number of workers, that's another way. You can also limit the number of concurrent download operations that happen within any given worker. By default, one worker will do five downloads at a time. But you can dial that down to one if you like or dial it back out. So essentially, as a distributed application, we've avoided getting into the essential feudal attempts to do smarter things that almost never really work out and cover all the bases of trying to understand what your user really wants to do right now. So here's how that, this is what that looks like. So starting from the left, we have a YUM client that some user type YUM install, VIM. And YUM talks to Pulp and says, Pulp, I want this file. Pulp first looks at its local file system and sees if I have the file locally, great, I'm just gonna respond and serve that file. If not, it goes through this new workflow where the first stop is Squid. There's nothing specific about Squid that we've done so you could use something like Varnish or any number of other similar types of proxies. Squid just happens to be very widely available so it was a natural choice for us at least. Squid's in reverse proxy mode and it talks to this streamer thing as a microservice. It's also part of Pulp. We've made this Pulp sandwich with Squid in the middle. So all of this conversation happens locally either on one machine or at least on your local network. So we don't have to worry about like man in the middle in your SSL connection if you want to do that. So you can still stay secure. And the streamer is responsible for knowing, okay, I know what file you want. I know that that file is available in these eight different repositories. So I'm gonna pick one and go access it. And if it's not there, then I'm gonna try the next one and go access it and start streaming the bits back. That's why I called it the streamer. So there are actually four unique HTTP requests happening here and bits get streamed all the way back the other direction to the um client on demand. Does that all make sense? Any questions about that? It's fairly straightforward. There are a lot of like edge cases that we needed to cover to deal with this but it's a very exciting feature. I think it's gonna be very useful to a lot of people. Okay, we have this idea of consumer tracking. We've in the past done quite a bit more with support for this concept of consumers, certainly managing what is installed on various machines in your infrastructure. There's a pulp agent you can run on each of these machines and pulp will talk to it and tell it to install and update and whatnot. We're getting out of that business. So I don't want to talk too much about that. At probably as a pulp 3.0, most of that's gonna go away. Certainly the agent is gonna go away. What's not gonna go away is the ability for pulp to help you still keep track of what is installed on which machines in your infrastructure and especially what updates does each of those machines need that is recently available. This is another feature that Catello and Satellite do a really good job of displaying in a graphical way and putting that whole workflow together to identify. Heart bleed just happened. And there's actually a really interesting blog post and there's a presentation. I think the video is not available. But a really good blog post from last summer about IKEA, how they responded to Heart bleed using Red Hat Satellite 6, which is this feature in pulp under the hood to identify this updates available. These are all the places that need that update, push the button, it gets the update out to all those places and you're done. So pulp can still help you with all that reporting of who needs what. We have some documentation as most projects do. pulpproject.org slash docs is the central place to start. And from there we have links to docs for specific content types. And each one is separated into both user docs and then like integrator docs and developer docs. If you want to write some software that integrates with the REST API, those dev docs would be the ones for you. This slide is more a reminder for me than something for you, but I have these fantastic pulp stickers up here. They're very nice stickers too. They have this like vinyl finish on top. They look very stylish. I would love to hand some out. I also have a limited number of Red Hat Satellite stickers and I think three or four Forman stickers. So if you like stickers, come up and I'll hook you up with some stickers. And with that, that's really all I have to talk about. So what are the questions do we have? Thank you, thank you. I do say replicating to say private clouds to get like a whole copy of my pulp server to another location. This is a good question. Is there a built in way to handle replication from one pulp server to another? The answer is sort of yes, but we have a first cut of that that we call the nodes feature. It basically does exactly that. It replicates both repositories with all of their configuration and the content in them from a parent pulp server to a child pulp server. Having now put a lot of mileage on that feature and rethought some of what we're trying to accomplish and coming to better understand with much help from our users, the performance implications of how that works, we're gonna redo it all. So what we're gonna do, for example, with Catello, they're gonna handle this by doing just a normal sync and just setting up within, say you have like a hub and spoke model, you have your parent pulp server and then others that are located in offices or we have customers that put these things like on trains and boats and things that only have connectivity for short periods of time, that you would just set up those remote locations to do a normal sync, the normal way you would from any other upstream content source and perhaps schedule them or whatever you need. A push model, that's it. Yeah, I see. So the suggestion is to have a way to actually push all that content down the spokes that the endpoints don't have access to contact the hub. Gotcha, okay. All right, yeah, interesting use case. All right. I have a question. Yes. So you mentioned not just syncing other content but uploading your own content. Yes. If you're say uploading your own RPMs, does pulp have a way to create the repo metadata and also sign packages? Yes, the first and not the second. Pulp absolutely does create the repository metadata. It handles all of that seamlessly. For any content type as well. Same for Docker or Puppet. It has the APIs implemented that are required for those other content types. So yeah, you can upload your own content. Pulp does not have any features for helping you sign it. In our experience, most people already have a carefully controlled infrastructure for doing that. But it'd be very interesting to think about a way that we could better support that. It seems to be a lot of homebrew solutions. You know, as these things go. But this may be the next opportunity for us to help out. Is there a way to clean up the repositories and maintain the control the size of the repository from going to big? Oh, very good question. Yes, there are ways to help control the size. There are a few different knobs you can turn. So one example is pulp does natural deduplication for starters. So if you have the same RPM, for example, or the same Docker image that's coming from several different locations, pulp will only store one copy of it and recognizes it's the same one and just stores references in the database to each repository it belongs in. So that's one way that helps. On the sync operation, you can tell pulp things like for example, I only want you to keep two copies of any particular RPM. So as new versions come out, it'll keep two old copies plus the current one and delete, remove other ones. Or you can say any piece of content that disappears from the upstream repository, I want it to disappear locally also. I have no interest in that anymore. And then beyond that, you can manually delete repositories and remove content yourself. Yes, you can have separate rules for the different types and in fact, even per repository have separate rules for each repository you have. Sorry, I got the mic over here. Yeah. Do you have like, are there public examples of how the promotion, like if you say, hey, these packages failed, whatever, don't promote them or, because I'm actually writing much the same thing, but are there good examples of that that you can share or that are in the community? I don't know of any public examples. Again, that all tends to be very, homebrew Jenkins jobs that orchestrated that kind of workflow. I've not seen, for example, like a Jenkins plugin provided that handles that for you, but that's something that several, I know a lot of people have done and maybe if you get on the email list and ask somebody would be willing to share. I'd certainly like to see it as well. Is there a question on this side here? Do you have the mic? Just to follow up to the other one, first of all, do you still have to remove orphans periodically to remove the old packages? But beyond that, for the Docker repositories, how does that integrate with something like Docker Registry? Is there interoperability with that or is this something that's meant to be completely standalone from that? Okay, yeah, two very good questions. You can certainly kick off this orphan removal process yourself. So there's an orphan removal action where pulp will go through and identify, these are all the content, all the pieces of content that are no longer associated with any repository, therefore they're an orphan. So it will then clean them up. You can also run that on a regular basis is a very normal thing to do, but you can kick it off manually as well. Now the Docker question is more interesting probably. Pulp can be your Docker Registry locally and it can pull content from the upstream Docker Registry. So you can, you can both, well, for Docker, there's Docker V1 content, there's Docker V2 content. They're almost completely different animals, even though they kind of kept some of the naming on top similar and the look and feel is the same. Very, very different content, very different APIs. With V2 content, you can no longer save your images to a file. So with V1 you can say Docker save this repository and it would spit out a tar ball that had all the contents that V2 is gone. I'd love to have it back, help me ask them to put it back and to bring it back, but we don't have it. So far we've been losing that argument. So you can't upload V2 content because you can't get your hands on V2 content but you can synchronize it from any Docker Registry including Docker Hub and then Pulp has a separate little app called Crane that can be deployed independently. For example, if you Docker pull an image from Red Hat that is Crane that's this little read-only web application serving the Docker API, serving those requests. So you can deploy Pulp the same way yourself to be that registry. Correct, yeah, it implements the Pulp part of the API. We don't accept a push, you cannot Docker push to Pulp. That just exploded the matrix of both API endpoint but more permissions of deciding, Pulp has a whole permission and user system I didn't wanna get into because it's a boring user permission system but it's there and trying to map Docker credentials onto that is a nightmare. So we don't do that but you can absolutely pull from Pulp and Docker has no idea it's not a normal Docker registry. Other questions? Going once? Okay, fantastic. Come up and get some stickers if you like. See you around, thanks. Test. All right, thank you all for being here in the SysAdmin track. The track is sponsored by Stack IQ, developer of Stacky and open source bare metal Linux installer. So thank you all for being here. We have here today, John Merritt, director of managed service with Stack A Technologies doing a talk on network device configuration, standardization with trigger as you can see. So thank you very much. Thank you. So I'm talking about a subject that's very interesting and important to me and to lead things off, how would you like to be able to reconfigure a thousand sites or devices in an hour instead of over 20 days of manual work? That's the type of challenge that I was facing that led to me developing the tooling that I'm presenting here today. So I was working on a migration project for a customer. They were moving between two different DMVP and infrastructures and that engaged us to help them kind of move through that progress. And they provided us with the scripts that prepared to do it, which needed to be customized for individual devices that you needed to validate a bunch of operational state. So say that it took you 10 minutes to make that change on a device, to check and make sure everything was good and to paste in the substantial volume of new configuration required at a rate that didn't cause iOS to start ignoring half of the commands that you typed. What I did then was I built out a script that validated some show commands, made sure that everything worked well and made the change. And we were able to automate that process and bring it down to a couple of seconds per device after connection. And I've developed those same techniques through our different practices and projects with customers and in managed service where we're managing environments with large numbers of remote CPE devices. And the stuff I'm gonna show you today is really kind of the culmination of the work that we've done on that in terms of going beyond even individual changes to devices and into a structured configuration management approach for network devices. So, let's bring this over here. The agenda for the day, first I'll tell you a bit about myself. I'll talk a bit about my feelings on the state of the networking industry. We'll look at the approach that I propose for network device configuration management. And that'll run through the structure of the code that I've created that's now integrated into the trigger project. It is something that may be a little less accessible for someone with more network background and less Python background, but you'll see in the subsequent section on adapting it to your environment that there are, once that framework is built, extending from it is a lot easier. I'll talk about a few planned improvements that need to be made to the tool and finally we can take a look at the questions that the audience will hopefully have about the subject. So, I myself studied programming when I was in college and when I entered the workforce I started doing system administration. It's now been distressingly, about 20 years that I've been working. And in that timeframe I worked more and more on system stuff, automation, and then I got more into networking and when I changed employers a few years back I then got into much more managed service and a lot more large scale network device automation. My responsibilities at Stack 8 where I work, I'm the director of managed service so I'm responsible for working with our environment to help our customers manage their devices. In some cases it's professional services engagements in others it's where we're providing a completely managed service like a telco would for remote site connectivity. I work in both network and security stuff. We do large managed service environments and presently we're managing about 2,600 of remote CPEs in a variety of different environments. The tooling that I'm presenting today is fairly Cisco focused at least in the code that I've written. However, the principles behind it are things that would apply to really any platform and if you're fortunate enough to be using another platform that gives you XML interfaces more advanced methods of retrieving configuration and operational state the same stuff can be carried into those environments as well. On the state of our industry, it sucks. The networking industry is very much, if you think about where system administration was 10 or 15 years ago where people were very happy to manually craft each server and hone each individual element of its configuration. That's really changed a lot and even in more recent times DevOps tight integration between different groups. All this stuff has really changed a lot and what I've found is that in the network industry those changes haven't really happened so much and haven't happened in a structured way. There's a lot of manual processes involved. I know that there's a very major telco up in Canada where I work where they actually build the configuration for devices with their engineering teams using notepad because the engineering teams don't have access to the network devices and then field technicians paste those configurations into the devices in the field but they don't actually know how to work on network devices and hopefully they report back if errors occur during the device configuration process. Now hopefully no one here's situation is that bad but there's certainly a lot of room for improvement and when I look at the things that people are doing to improve the situation you have a lot of vendor proprietary tooling each of the different vendors has their kind of suite for managing or attempting to manage their devices and you also have commercial tools that are also quite expensive and often I suffice to say I've seen some very expensive implementations of SSH in a for loop. So I feel that there's a lot of room for evolution in this space and I think that the open switch stuff that we're starting to see from the open compute project, cumulus networks, things like that are I'd like to think the beginning of a sea of change like the transition from traditional Unix environments to Linux was and hopefully we'll be able to really improve things in the way that other parts of the IT industry have changed. In the configuration management space and I imagine this probably applies to most any part of configuration management not just for network devices. I first am looking to identify specific issues so we might see VPN flapping where specific devices are having a problem and seem to be disconnecting or reconnecting to the network. Doesn't really seem to have an operational impact but at some point it will and you wanna understand how big of an issue is this, how common of an issue is this. So I'm identifying either issues in the environment or new projects that need to be delivered for a customer and then we're generating reports that allow us to sort of examine the information about the systems and see how serious a given problem is to allow us to kind of prioritize specific changes that need to be made, what's more and less important or pressing. Finally, the tooling that I've developed allows us to then make those changes to the devices and in my environment with large numbers of remote devices in different field sites, I don't have all of my devices operational at any given time. Something's being serviced, there's an outage in a given area so it's important that the process that we use to make these changes is able to handle a device not being at the current reference, examining the state of the device and determining what changes need to be made it to bring it to the desired configuration. Some changes as well are gonna be dependent on operational state. For instance, if you wanna make a change to the cellular internet connection on a device but that's the active internet connection and the primary internet connection isn't present, that's not a change you wanna make at that time, especially if the consequence of it is losing the device. To achieve these goals, I've used a number of different approaches. Today I'm speaking about the trigger-based approach that I'm using now, but I've used a bunch of other tools and probably maybe some of you have as well. So Rancid, which I suspect is familiar to a lot of people in terms of a tool to collect network device configuration and report on changes in it, also has the ability to do custom scripts using C-Login, its tool for connecting to devices. So I started on that first migration project I spoke about before, writing scripts in TCL and pushing them out using C-Login. TCL is not the greatest language ever. Two little pet peeves I discovered during that project. If you put a comment that includes a parentheses, that will still be interpreted by the parser, but it does at least give you a warning. If you do like an if-else-if-else construct, but you typo else-if and don't spell it in the specific way TCL likes, which is different from Perl or Bash's, it'll tell you that else is an invalid command because it's, I think it might have interpreted the typo of else-if as an anonymous function or something. Anyway, it's really bad. I transitioned from that to using Perl and I'd move all the logic out of the TCL script. So I didn't need to do logic in TCL. I'd get state from a device, I'd parse that state, I'd decide on a method of changing its configuration. I'd generate a TCL script that I'd then run with C-Login. So it got rid of a lot of the pain of TCL, but it wasn't that much. It was a big improvement, but there was still much more ground to be gained. I've cut back on abusing Perl and switched over to Python, which has a much more structured approach. I know it's possible to write clean Perl, but I can't write clean Python, so it's not gonna happen with Perl. And then most recently I discovered Trigger, which is a framework for interacting with network devices. It provides a fairly elegant interface to connect to devices, but it is very much a framework. It's something that I've built tooling on and that other tooling can be built on either directly on Trigger or on the stuff I'll show you here today. And that has really improved the methods that I'm using and able to, yeah. So in terms of the steps that we have in the process of making change to a network configuration, first, I'm gonna talk a bit about the approach that I use within the code that I've got, which is generalized, whether you're doing reporting or making changes to the configuration. Then I'll talk a bit about the specific implementation of normalization. It goes a bit into detail on the Python code, but I think it's important to have an understanding of how that framework has been built, reporting which leans heavily on the normalization approach and then we'll talk about customization. And I've built out in the presentation a concrete example of a new type of reporting that I did just for this talk to show how easy it is to take the work I've done in understanding how these things work and carry it into customization for your own environment, which is a little more straightforward than the groundwork that I've done. And finally, we'll talk a bit about some improvements that I'd like to make to the tool. I left out something important about Trigger before. One of the big advantages compared to C-Login is that it is able to connect to multiple devices simultaneously and because of that, if you wanna process stuff for a thousand devices, it's a lot faster when you're doing them 10 at a time or 20 at a time as opposed to connecting to them in serial, especially when some of them are over cellular connections or are not online and you're waiting for them to time out because they're not responding. Specifically, the approach that I've taken is to first collect the configuration and device state. So we're looking at what the running config is on the device, what it's operational parameters on, then to parse and structure that information so that we have a clean representation using objects, dictionaries of the device state, which we can then analyze and determine if there's changes that need to be made to the config. If changes are required for a device, then you make those changes and you, I also store all of the information I collect from the device into a repository of device information. It's useful when you're doing reporting because you can do reporting without connecting to the devices so you can reference that stored state information if you wanna see how many devices have this specific config in place. You have the information local and you don't need to go out to the devices again. Inside the example code, we have four critical files. We have router.py, which has the kind of core logic to this system. There's normalize.py, which is a script which actually will perform a normalization on a device. Report, which is a reporting script. And I have a device list with three routers with highly traditional names. So if we look at how that, is that, okay good. A previous presentation I saw in this room a few hours ago, the text was illegible. I think this should be pretty okay, but if there is a problem, please let me know. So I ran normalize. It found that there were, it prompted to say, do you wanna run all sites because you can select a specific subset of sites? It chose to, I went forward with it, so I'm processing routers one, two, and three. One of them is down, couldn't ping it, wasn't processed. And one of them didn't have the trigger ACL present on the device. And that ACL that we're validating is example code, permit 1.1.1.1. A lot of the stuff that I've developed is very specific to the device configurations in my environment. So what I have here is a very generic example. I'll come back to that a bit when we talk about the improvements though, because I think there are more common things that different people in this room might share and benefit from that could be done in a community way, even if a lot of things are gonna be specific to your own environment. So at the heart of the system with Trigger, which is based on the twisted framework for device interaction, I have a callback processing. And the callbacks, the way that they work is we take the list of devices and we get the details about them. The returned information from that is then processed by a script that validates the device's state. And then finally we initiate the normalization of the devices. So a device that needs a change is going to then build out the change to the configuration. If we, oops, just a second. Okay, so called out the critical elements of the function. So in normalize.py we have the get router details, which is an instance of a Trigger commando object. It's checking to, it's got a list of commands that'll be executed. And each device is checked to see if the device is available, tries to ping it, if it's available, then it's returned for processing. And within the common router.py code we have the actual show commands that will be run on the device. So we're doing a show run include IP access list and we're getting a show ver showing us the version of the device. When we then take that information back to parse it, the information, so each device that was processed and the result set that was returned is run through the validate function. And the validate function is first updating information on when the device was last contacted, which is used in the reporting. And we call the validate ACL function, which goes through the results of the show run IP access list, splits that out and uses a simple regex to determine if it's a standard or extended access list and to get the name of the access list, which is added into a list of ACL objects known on the device. We then have the functionality that normalizes the configuration. So for each device we call this normalize function, which is down at the bottom in the router.py. We look to see if trigger test one is present in the list of ACLs that were defined on the device and if it isn't present, we set a variable that tells that the device does need to have its configuration change and append to the list of commands the changes that need to be made to the configuration. So if you have, you have a bunch of different checks and those different checks, you'll see if it changes required to the device or not, and all those things are built together into a list of commands to be executed on the device. When we actually go to make the changes to the device, we just go through the list of devices returned from that function and if they do need to have their configuration changed, then the commands that were built for that device are pulled out and used as the commands to be executed on the device. And then finally, we validate the return from that to make sure that the right mem that's executed at the end succeeded. On that thousand device migration project, we ran into a few machines that when you saved their config reported that the flash was bad, which is always interesting and I wonder how long it had been the case. Most of them were fixed with receipts, but not all of them. Finally, we store the state of the device into, in my case, I'm just doing it with a JSON object. I take the list of objects and I represent it as a JSON object. If you had a larger number of devices or you're doing something more advanced, a better storage mechanism might be required, but it works quite well for the scale of devices I work on and would extend up to tens of thousands of devices fairly easily, admittedly perhaps not with the smallest file. For reporting, the output is fairly simple. You run report, it automatically will use all devices and this report just shows the device when the device was contacted and the firmware version running on the device. And then I've run it a second time where I've selected a specific device to be accessed and you see in the report, the access time for that device has changed. So when you've got a large field of devices, you're not going to reach all of them every time and knowing when information is from is extremely important in understanding its value or deciding how you want to act on it. So this slide should have come before the previous one and it talks about that reporting process. Okay, so this is actually the, this is the overview of how that actual code works. So we load the state information that's present right now, we identify the devices that require updating. We connect to the devices, we get updated information about them and we generate the report for them. And you'll see that there's a lot of commonality between the normalization process that I reviewed before and the reporting process now. So we're using most of the same functions that were present in the core router.py code. We just don't do the normalization. So we're doing the same validations, it's using the same validation code and the same commands to collect information but we don't need to make any changes to the device. So we have code to output a CSV which is pretty straightforward, goes through the different routers that are present and writes out when they were accessed the device name and the version. We speak about adapting this to your own environment. There's a lot that you can do with this to take the framework that I've made and to implement specific things that you're interested in, things that need to be validated in your own environment. I suggest that you work first on creating small tests that are exploring individual and smaller elements of what you're interested in doing and then build from that over time. You can also bring in information from other data sources. So I do a lot of stuff where I'm pulling information from monitoring systems about historical device availability. You, in larger environments, probably already have decent repositories of information about your network environment and that all can be pulled together and in. So what I built as an example of how it's relatively easy to extend is something to parse which ACLs are applied to which interfaces in a Cisco or Cisco-like device and this single slide actually shows the entire change that I made to do that. So it is a lot easier to add things on top of what I built than it was for me at least to build out that initial example. So I've added in a new show command that looks at the running config and gets the interface sections and then inside the validate function that we're calling to validate device state, I've added in validation of the interface, a new function called validate interfaces and inside there we create a dictionary of interfaces. I clear it at the start because I had some bad experiences with interfaces being removed that because I didn't rewrite the entire object each time, it would cheerfully report really old information about interfaces that were no longer on the device. So that is important. And then for in the config I go through the show run interface sections, I pull out the interface names. When you find an interface name it gets added to the interfaces. And then we build for each interface if it's got an input, an inbound and an outbound ACL those are sort of recorded into a dictionary underneath that for the specific interface. I've only done ACLs here but obviously this could be extended to all sorts of different stuff. IP address configurations, QOS configurations. And this is kind of something where there's a lot of, this part of the validation is something that a lot of different people have the same needs for as opposed to how you take this information and validate its relevancy to your own environment. So we have the output from it and I ran it against one device and we can see that the gigabit ethernet zero interface has an inbound ACL of test applied to it. There is a lot of room for improvement in what I've done. One thing is I've done a fairly, I've done a manual job of building this interface parsing code but first a lot of it is stuff that could be applied to any device in any environment and it's something where there's more common code that could be created. And there's also a Python library called Cisco ConfPars that will build a representation of a Cisco or Cisco-like device configuration kind of automatically it doesn't have as much understanding of how it works as something else might but it understands hierarchical command configurations and nested devices under each other and it's supposed to do a pretty decent job. I haven't really played with it that much but it is something that would be pretty cool. A much larger thing would be the development of a domain specific language. So right now, all of the stuff that I've done, you need to actually understand Python and unfortunately there's lots of people that know Python, there's lots of people that know networking but the overlap between them isn't necessarily as strong as it could be unfortunately and there's not as many people that seem to be you know, fluent in both we could say and bringing it to a domain specific language could make it easier for networking people with less programming ability to work on this type of system and to benefit from it. You also need to avoid creating a domain specific language more complex than Python which well, I think we've all seen examples where that could have gone better. Finally, another big area for improvement would be more community involvement. So I did this, I was fortunate enough to be able to kind of contribute back open source portions of the work that I've done and I'd like to think that there's other people that are interested in this subject that might also be able to work on this both for their own internal use and outside. There really doesn't seem to be much public work done in the network automation space and especially with open source tooling. I know that there's a lot of people doing stuff behind closed doors and that there's things that can't be shared with the outside but I think there's also a lot of room for more improvement and more change and I hope that this talk will maybe get a few people interested in it and that we can start to build more of a community around this and other related projects. I have many people to thank for being here today. First and foremost, my wife, my kids, my family who support me and put up with me and other stuff. Jathan, who created the trigger language framework and was able to also make that an open source project and available to other people without people doing open source work, there's nothing for other people to build upon. Stack 8, who's given me a great place to work and to develop my own skills and ability. My good friend, Henrik, who helped me with Python and actually didn't want credit here because of the quality of my Python but I try, I try. Charlene, who really turned this from a much more bare bones presentation into what it is now, the trigger community on FreeNode and my friends in the Linux channel, went through that a little and if you'll wait for the microphone. Why use trigger and not use something like Ansible? And not wish, I'm sorry? Ansible? Ansible. At the time that I did it, I don't think that Ansible had the support for the Cisco stuff. Also, I found it very, very hard to find anything on the subject. So when I was doing the C-Login stuff and hating it, I was looking for people doing this kind of work and I eventually found Jathan's presentation from three scales ago that spoke about trigger. I recently did learn about the Ansible, Cisco integration stuff, like the bare SSH support stuff. I'm interested in checking it out a bit more and really, while I'm happy with the work that I've done with this, I'm also open to other approaches. I'm looking for the best possible ways to do this. I'm looking for the best ways to scale it. I'm looking for devices, things that are more cross-platform and the more stuff that I can do with it, the better it is. I heard about some interesting stuff involving salt and network device automation as well, and I'm interested in checking out all that stuff. So while I'm presenting on Trigger today and I'm very happy with what I've done with it, doesn't mean that we couldn't be talking about something else in a few years. Hi, this is not a question, but I did some stuff with Trigger where it would log into devices and parse gigantic config files. And a problem in general with some of these devices is that there's no grammar for the, and you don't necessarily want to write a grammar, there's no codified grammar for things. So a lot of these things will actually dump XML and if that's available, I would definitely recommend using that versus trying to parse. So I know that developer devices do that. They have fairly strong XML support, as I understand it. Unfortunately, I haven't had much opportunity to work with them. I'm just saying if it's available. So obviously something that gives you a structured form of the configuration of the device would be preferred. I know that Cisco worked on some stuff with 1PK, which was supposed to be this kind of platform for automating their devices. And I went to a presentation in their offices in Montreal and many people were there for the lunch. I was there because I was interested in what they actually had to say, somewhat to the astonishment of the presenters. What I found, however, was device support was very limited and on a lot of non-Nexus devices, it was also necessary for you to run T-Train releases. Cisco subsequently- Oh yeah, no, I'm not saying it, you know, like everybody does it. I'm just saying if it is available, it can save you a lot. And sometimes when you're parsing, you can go line by line, but in some cases the configs are involved enough that you actually have to parse it, you know, into a parse tree and try to suss out like this ACL is attached to this device sort of stuff that is difficult to do without doing a full parse. And Jay, I think, can speak to this too, but I mean, you know, like they've done parsers that ended up being sort of frank in parsers for multiple devices, but it's difficult to say, you know, is it correct for any particular device, right? In Trigger's method, you do have a method of when you define the device, and I didn't really show the details of the CSV file. You also state what platform it is or what family of device it is, and that can be used to drive both how you connect to the device for devices that have like non-SSH support. And I'd imagine you could kind of feed that information back in in other ways as well. Yeah, yeah, no, I mean, there is parser support. It's just getting to the point where it was like, oh, this is a little different, oh, that's a little different. We'd actually talked about splitting into, you know, very manufacturer specific parsers. Anyway. You mentioned that you, and I apologize, I missed part of the first part of the talk, so maybe you covered this, but you mentioned that you were looking at other open source options or you didn't find any, but I was wondering if you looked at any of the prior works such as NetConf and Yang, which are standards-based protocols for manipulating figs on devices or even some of my colleagues are working on OpenConfig, which is an evolving standard right now, which is actually documented and available for driving device configuration in a vendor agnostic manner. So, unfortunately, I did not hear that question very well at all, and I think that the speaker facing is much better for the audience than the presenter, but I did hear you mention NetConf. Yeah, so I mentioned NetConf and Yang, which are sort of prior art from like 10 years ago on sort of programmatically driving device configuration, but it's also mentioned OpenConfig, which is a new standard protocol, that standard-based protocol that's being developed for device configuration and vendor agnostic way. I haven't heard about those before, but I'd be very interested to check them out. And maybe if you want to approach after, you could tell me a bit more about it. I'd welcome it. Any other questions? I have a bunch of references, starting with this presentation, which should be online on the website shortly, and I've got a relatively short URL if you want to grab it here. I've got a direct link that leads to the example code, which is in the GitHub Trigger repo, the documentation for Trigger, as well as the Trigger GitHub project page. For people that are doing, for people that don't know Python, but are interested in this subject, I pulled out one well-known Python guide, the Learn Python Hardway, which apparently is actually the easy way to learn Python. And it's supposed to be a very good approach to Python, especially for someone who doesn't necessarily come from a programming background. I think that if you do have programming ability, Python should be fairly easy to pick up if you don't already know it. I've got my website and my email address and my employer, Stack 8. If this stuff sounds interesting to you, but implementing it in your own environment sounds hard, we could also talk about that. And finally, I got all the graphics for this presentation from a website called flaticon.com that has a lot of really nice, clean little glyphs, great for presentations, other stuff. Here's the licensing on the different files. Thank you very much for your time. And if you're interested in talking more now or at any other time, I really am interested in learning more about this and hopefully other people have things to share too. Testing, testing, testing, testing. Hello, hello, testing, testing. How's the volume? Can you hear me okay at this level? How about that, is that? Okay. All right. Cool. Again, my name is Dave Boucher or for Salt Stack and I'm really excited about this talk. It's a fun one for me. It's kind of a crazy stuff. You never do it in production or in real life but kind of give you an idea of a good or vision of the possibilities of things you can do with Salt. Who here has used Salt either in a testing or production environment? Okay, so about 80% of us probably grabbed a guy's shake in his head. Yeah, he does. Okay, so I'm going to just talk a little about some of the very basics. Some of the general use cases of Salt that people use Salt for. And then I'm gonna talk about some of the other pieces of Salt that people often kind of miss when they think of config management. They don't realize there's so much more to Salt than just that. And I have a couple of demos that will be interesting. And so get to use your cell phone. If you have a cell phone sitting down here, mine doesn't, unfortunately, but we'll be able to, everybody will get a chance to participate. So beyond config management. All right, so Salt Basics. There's, generally you would use Salt with a running a Salt Master that controls other servers that are running an agent called the Salt Minion. If you have a situation where you don't need a constant connection to your servers, you can use Salt SSH to have Salt work across SSH in a, like, one time or, you know, not have that constant connection all the time, but just go over to SSH when needed. So this is kind of what they typically would look like. So you have your Salt Master, you have a couple Minions here connecting through Zerum Q, and then also Salt Master reaching out to one server through SSH. If you'll notice, the arrows are pointing from the Minions up to the Master because when you're using Zerum Q, Salt uses a PubSub interface and the Minions actually connect to the Master. So the Master needs to have two ports open for the Minions to connect to and the Minions listen on that published port. Whereas when you're using SSH, Salt Master does reach out to your Minions and does things. When you're using Zerum Q in this PubSub interface, that allows Salt to scale into the tens of thousands of servers because the Master doesn't have to, you know, specifically manage every server it's connecting to, so that's why it's done like that. You can also use the Salt Minion by itself, so with no Master at all, so everything's just done locally. Sometimes I'll do that on my laptop, I don't want, you know, sometimes I'll have a Master and a Minion running on my laptop but other times I'll just use the Minion by itself to configure my laptop. People do that in production as well. If for some reason they can't or don't need a Salt Master, they'll just use the Salt Minion by itself and do all the configurations locally. They'll pull them down from like S3 or other locations. A Cyndic allows you to have hierarchies of Masters. So you can have a top level Master that can see everything in your entire infrastructure but then maybe you have a lower level Master in different data centers or maybe different business groups have their own Master down below and that allows you to have kind of a high level view from this top Master but lower level control down below. It also helps with scaling as well. Minions can also talk to multiple Masters. So you can have a multi-Master situation where they'll accept commands from Masters. Some people use that for a kind of poor man's failover type thing for your Master or also for different people needing control or visibility into your servers. So advanced Salt features. One of the things that really differentiates Salt from others in this space is that fast encrypted dynamic connection that your servers have between the Master and the Minion. So we have an event bus that runs locally on the Minions and also runs from Minions up to your Master. So events, things that happen on your Salt infrastructure get passed up to the Master for various purposes. Salt itself sends lots of events. So when a Minion starts up, when a new Minion tries to connect to the Master, there'll be an event. There are a whole variety of things that would cause an event to be sent up to the Master. You can also send your own events. So you can have your own application send an event to the Salt Master. There's a whole variety of things you can do with that. One of those things is a beacon. So a beacon is a small service that'll run on your Minion that will monitor something. So we have beacons that'll monitor your file system, that will monitor your SSH logins, a whole variety of things. And we'll be talking about a Twilio beacon that monitors a Twilio text message queue and then send an event on the event bus. Reactors and engines. So the reactor sits on your Salt Master and it sits there and listens to that event bus and allows you to do things, react to those events. So maybe there's that one stupid application you have to manage that needs to be rebooted every time it hits a certain memory level. Maybe you can check for that. And instead of having to manually go in and do it yourself and events get sent up to the Master saying, hey, this dumb application needs to be restarted. And then the reactor sees that event, restart your application and moves on, logs it somewhere and you know, barely a hiccup. Just a simple example. And the engines are similar to that. They're a Python interface. So you can create your own little programs that will listen to that event bus and do things in Python. So there's a little more flexibility there in Python. Tonight we'll also use a Salt API. You can turn on a REST interface. It allows you to reach out to Salt from any other application and do things with that. So if you have an existing CMDB and this is an application that manages things on your network, now Salt's entire, all the possibilities of Salt are now available to your application. So we have a Python interface as well as this REST API and really makes for some really interesting use cases there. So again, we'll demo that a little bit today as well. So again, it's just like an overview. You have a Minion here. Beacons and engines run on the Minions, reactors, engines as well, and the API run on your Salt Master. Okay, now we're gonna do live demos. So, yes. All right. I had to rebuild the server last night to make sure everything works smoothly, so. So as you see here, what's that? Yeah, exactly. All right, so you probably all, I think after four years I wouldn't remember that. Okay, so I have a Master, I have one Minion that I'm controlling here and you must be probably familiar with it. We have this whole slew of modules that you can run. When I want Minions, I'm going to run disk.usage. We're gonna get that data back almost instantly. Yeah, do we have, actually I could probably increase the, better? All right. This is how we work at Salt anyways. Over in the engineer section, all the lights are off like this, so. So you can see, you can also get that data back in different formats. So say I want that in JSON, so I can ingest that with some other tool. There's the xm command output in JSON. You can get it in raw text, in the raw Python data structures that it comes back in. Everything installed is a data structure, so that allows you to do a whole variety of things, including managed output like this. You can also do command.run, will allow you to run any arbitrary command on your systems here. And that was one of the things that got me really excited about Salt originally when I first found out about it was just, I felt like I had this power. I mean, finally I could reach out and know what's going on on the servers on my infrastructure. What's this space looking like? There's this one server that we're constantly checking. We have monitoring, but what's going on right now? You can run these commands, get that data back, and do something with that really, really fast. In fact, the remote execution was built first. When Salt was first written, it was just all it was remote execution, and that kind of informs the rest of Salt. All the Salt states, all the event bus. Everything is based off that initial remote execution that Salt provides. This is a brand new server. So if we want to run a Salt state, just for those who haven't seen one before, let's do an Apache server here. Okay, so here we, this is a simple Salt state. Apache 2 is the ID declaration, which also will become the the name of the package we want to use. We have a package state and installed is the function that we're going to call. And you notice it's installed, so if the package is already there, it won't do anything, but if it's not, then it'll do the right thing. So let's install this thing. So we're on the state.sls, Apache server, and the amount of time it's going to take here is mostly just apt checking to see if the package is installed and then installing it here. So I know for a lot of you, this is like the very basics, but I just wanted to kind of give a quick overview of how this works. Yeah, it probably does. He mentioned it was probably doing an apt update as well. So that's a simple Salt state. Okay, there we go. It took 32 seconds, installed all the dependencies we can see here. Again, this is all a data structure, so I could have gotten this back in JSON, but you can see here we got the new Apache 2, installed all the dependencies and gave us information about that. Now at any time, I can run that again and it will do the same thing. It'll check to see if it's installed and it already has installed, so it's not going to do anything. I can also, here on the command line, just come here and do, and we'll do the package.perge. And when you're on the command line, I'm using the actual execution module and this is actually going to do this. It's going to remove that package. So there we go. We're done. A friend of mine that runs a bunch of the websites for the major newspapers in Salt Lake was on a train heading home when Hartbley was released a while back and you hopped onto the crappy Wi-Fi on the train, ran the little check using this to see which servers were vulnerable to it, got that list, got that list, ran the command to fix all of them and it was done in the whole process. It took six seconds to actually do. And seriously, that's one of the things, literally, that got my attention about Salt was the fast control that you had and the simplicity, yeah. Okay, so now I'm going to talk about Beacon. So a Beacon, like I mentioned before, is a little service you can start on the Minion that allows you to watch something. So I'm going to set up a Beacon that will watch a specific file for any changes and when it sees a change in that file, it'll send that event up to the master saying, hey, this file changed. And I'm going to set up another one that will look in a directory and notify any changes there. So if I come in here, so here we have our Beacon's definition. Let me use the I Notify Beacon. This does require the, I think it's a Python notify Python module to be installed. So any event, anything that happens to this serve slash serve slash blah dot txt file, Salt will send a message up to the master. For this slash serve slash testing directory, I'm just going to look for the close write event and send an event up to the master and we're going to recurse through any directories that are created. We're going to auto add any new files to it and then the interval says we're going to check every 10 seconds. Okay, so down here I'm going to start a little listener so you can see what's happening on the event bus. This is a great way to see what's going on when you're debugging. So let me go into slash serve directory. Here's that blah dot txt. So let's open this thing up and another line, close that. And then you can see right there, we've got a notification that, let me increase that size a little bit, that the file was modified, the change that happened and the path to the file that happened. Going to that testing directory. So let's create a new file. You can see here in that slash serve testing directory, we had our close right, another right, close. You can also see there the swap file that Vim created was also modified there. So what could you do with this? Let's say you had some little sys admin that just won't, or some developer will not stop logging into production server, start editing files and I was always getting in there and modifying things without going through your change management. You could watch your Etsy directory and anytime there was a change an event was to be sent to the master and you could clobber it and say, nope, re-do it. We had one more clients do the same thing with, yeah, you can send an alert through PagerDuty, announce it to the world. You can also check for like SSH logins and things like that. So someone logs into a production, like, you know, web fraud end that you should never do. You could just like log them out and kick them out right off the server immediately if that gets, or you could just notify PagerDuty, notify their boss that someone was behaving. Do you have a question? Does spam have the company when somebody modified a file and they shouldn't be? Automated pink slip, I guess? I don't know. Krister, yeah, so all this does is it sends an event on the vent bus. So then you can use an engine or a reactor to do whatever you wanna do. So it could be anything. It could be add something to your CMDB. Yeah, exactly. So Krister was talking about his security module for file integrity. This could be part of that. You know that file shouldn't be changing at four in the morning and you can get a notification. Then you could log that however, whatever monitoring system you have, depending on the file, maybe do an alert if necessary. Just because you have visibility into that. There are a variety of modules. So if you come here to the beacons section. So we have a disk usage beacon. This one allows you to set thresholds for notifications to be sent. So you can say, hey, if you ever get about 85% on these specific file systems, send us a notification, that type of thing. By notify we saw here, journal D versus system D. Same thing with load, you can do your criteria when you wanna send a message. People sometimes will use that for load balancing, that type of thing. Certain server gets maxed out. You can get above so many, your load gets so high it leaves me to speed up another VM to help manage the load. Services or PS for services that are running, a whole variety of things. So WTMP is for your logins, that type of thing. So and we're gonna use a Twilio. Take my message today. And let's look at that code for these. You'll find, you'll see that these are extremely simple. So if I come down to this Twilio thing here, you'll see that we have some documentation, we have some imports, we have a validate function. And then the only requirement you have is that you have a beacon function. You can see here we have our documentation. But this is the entirety of, so that's 30 lines of code including logging and nice clean formatting to check for these text messages as well as a check to see if they have images and things connected to it. So if you had a beacon, you had something else, say you had an application that you wanted to monitor locally on your server, you very easily could create your own beacon to run your checks and send events as needed. Any questions on these beacons or anything? Sure, yeah. So the question was could you have a beacon that would watch a log file and then if there's a specific error, notifies, dislog or something like that, you definitely could do that. So here we have our Twilio account here. I think I exposed my account credentials in the previous one up on the file thing, so I'll change those before this gets on the internet, so. Yeah, I do. This is live streamed? Okay, well, we'll be changing that quickly. Again, send the interval. This WTMP again is for if I wanna check logins, so actually the first time I gave this talk I got a guy from the Apple in the audience, I didn't know he's from Apple and I forgot to take it off my server. It's like the next day and a half he kept getting text messages that, I had to think so, if whenever somebody logged in it would send a text message to a person and he texted me back, like hey, could you take me off that server? So those are beacons. And again, you have reactors and engines, that can do things with those events. So in the demo that I'm doing today, I'm not actually going to use a reactor or beacon, I'm going to have a little webpage that the Salt API is going to expose and the JavaScript is actually going to bind to the event stream and listen for events from my Twilio account and update the webpage when it happens. So just a quick view of this, friend and co-worker at Seth House actually wrote this in like four minutes for me. He's more of a JavaScript guru than I am. So here we're going to listen to the events on this event source and then when we're going to update and create a new module or new elements on the webpage. So let's see what that looks like. Sorry, I'm losing, should I be here? Okay, so here's our simple webpage. Also promoting SaltConf in April. You got to come out, it's going to be awesome. A few of you have been to SaltConf and it's a good time, a lot of learning and it's really a fun conference. And is that you? No, it wasn't you, it would be giggling. Oh, that was Charles in the back. So if you text that number up there, you'll see the text will show up here. Somebody wants to send a, you can send a picture text as well. Please keep it g-rated. This will be on the internet, but somebody wants to send a picture not on my zipper down please. So again, what's happening here is my server is running a little beacon and it's checking every five seconds that Twilio will count and when it sees a text, it just sends an event on the event bus and there's a multitude of things we could do there. You know, we could have reactors happening and you can do many things with each event that comes through. In this example here, we're just listening to that event stream and on the Twilio ones, oh that's a big one. Yeah. And we're just dumping these to the web page here. So obviously you wouldn't want to do this in production, but maybe you have a monitoring system. You have pager duty, you have a whole variety of things you might do to monitor things and you can listen to these events and do things with that. Any questions about, let's go so far, yeah. Good question. So all it did was added that config file for it and then we started the Minio service and it'll process all the beacon definitions and do that. Yeah, oh yeah. No, you easily could have that run. You can see that events, take that text of that event and use command.run from the event loop and do that. Yeah, so obviously all of the security precautions you would take accepting input from external sources you would want to do here as well. Yeah. But this kind of thing you could do, if you have like a chat ops type scenario, people want to deploy from Slack or from IRC or something like that, you can do that with Salt. You could also set it up so they could text something and deploy from Salt. We had a customer that had this window application that would frequently lock up and we worked with them with their first line tech support and it was a pretty simple thing to do but their first line tech support people weren't technical enough to actually log in and restart this thing without screwing things up. They kept escalating it to someone who could do it correctly but it was fairly frequent like it had a lot of customers that would call in and do it so we worked with them to add just this red button in their thing and so the first level tech support could log in at the customer's account and say, hey, my thing, whatever's not working and hit that button. And that bottom one there scared me when I was a kid, Dave. Yeah, so they just hit this button and the higher level tech support people never had to bother with that again. They would just hit that and they gave them the ability to take care of the customers without having to manually go in and fix their problems there. So I was gonna text from here as well. I usually send it from my phone but I get no bars in here at all. Screen line, yeah, there we go. Any other questions about this here? Is there anything anyone, yeah, go ahead. Sure, should we get to the microphone? Hey, okay. First of all, the message queuing model that Salt uses seems to be quite different from any of the other configuration management systems that I've seen and as I understand it scales extremely well. What kind of volume can you actually handle in terms of agents that are connecting into that message queuing system for simultaneous processing on large fields of devices? Really good question. So it's one of those things that kind of depends but we have people that are using up to, 18,000, close to 20,000 minions connected to one master. You can, there are some difficulties when you get to that scale. Sometimes just the flood of data that's flying back. You know, your NIC can't handle, your operating system can't handle that much data but there's ways to ameliorate that using Salt Syndix. We have lower level masters that handle all the cryptography and communication with the minions themselves that then wrap up all the results and send them back up to the high level master that type of thing but places like LinkedIn, I mean they have, you know, every time I talk to them it's like they're bigger but like one point they had like 30,000 servers and I think like 100,000 servers now. Not necessarily all of them running off one salt master but a couple of different masters there. So I mean it's really common for people to run salt on like one or two, three servers and then we have a lot that are in the, you know, five to 8,000 range very comfortably. Cool. I'm also very interested in network device configuration management. I don't know if there's anything you can speak to about salt being used in that kind of a context. Sure. So we have some modules that will reach out to network devices and things. There's a somewhat new feature of salt called a proxy minion that runs on a regular minion and allows you to take like a say a network switch that you can't install the salt minion on or something like that. The Adobe we're working with them on a, I hope I can mention this, but on a module we're talking to some server hardware so they can configure the hardware and then bring up the servers and that type of thing. So this proxy minion allows you to, the proxy minion handles the communication with a device and then allows you to use all of salt's modules and things like that as if it were a server and that type of thing. It's like a small subset of things that make sense on that type of hardware. Yeah, yeah you're welcome. I can actually get that to work. Emoji we're actually breaking it because I had, I was trying to cast everything to a string and so it was actually breaking the, so I had to go in and hot patch that code. You're getting close to the line people. Go ahead. When I run a command again, it's all my minions, which I don't have that many, like maybe 60 or something. The first time, if it hasn't been run in like a while, the first time a couple of the minions fail to respond, but then like if I hit it again, everything will be fine and happy for the next while as I'm using it. Is that normal? Are they like sleeping minions? You have to kind of warm them up. Exactly. So what happens is, by default, every 24 hours, the salt master will rotate its encryption key that uses for the communication on the publish port. And if the minions haven't communicated with the master, anyway, so the next time when that happens, the minions have to, when they connect to the publish port, they can't read, they can't decrypt the pack, the load, and so they'll request a key refresh. So if that happens like at midnight, and your minions haven't talked to the master at 8 o'clock in the morning when you log in, the first time you run a command, it can throw that off. So there is an option, that's what happens, so you have to run like two commands to get them all working. There is an option to have the master do a test dot ping after the key refresh so that it automatically updates all the keys like that. But that's what's happening there. Also, is there any kind of testing framework for salt? So salt tests itself internally, because states inherently test themselves, to an extent, but we don't have an official framework yet. We're kind of like, working on ideas on how to do that well. There is a plugin for Test Kitchen that's, employee of Field Packard wrote, so you can do, use Test Kitchen, which is a Chef tool to use salt to test your states and build up a VM with all your states and then make sure that they're behaving appropriately. Some of the ideas we have with that is to be able to specify success, parameters right next to the actual states themselves. So as you build up your states, you can specify, I spent my web server, when I access port 80, I should receive this type of text, that type of thing. But it's still kind of in the design stage at this point. And I'm not sure, yeah, I'm not sure if that'll even be gotten to this year, so for the moment, that's all we have. Any other questions? Would you like to do? Okay, so you get immediate audience feedback. It's awesome. Yes, and Utah Dave is my IRC handle. I'm Utah underscore Dave on Twitter because there's a real estate agent that has Utah Dave. He actually yelled at me on Twitter once. Who's using my copyrighted nickname or whatever. I've been Utah Dave, yeah, thanks for that. I actually don't have any on this server. So a reactor is a simple YAML file that says on your master, and you specify a tag that you want to listen to. And you can use Jinja to look at pieces of the data structure of the events that are coming through and do things with it. Now, when you're creating these, this reactor loop needs to be really, really fast. So you do have access to Jinja, which means you have access to the salt modules and states and all that kind of thing. So technically, you could run salt commands and various things from there, but you do not want to do that because you want the reactor matching on these tags really, really fast. Because then when it matches a tag, it will spin off another process, another worker process, to actually run the commands you want. Okay. All right, so here is a reactor that was listening for the Twilio thing. Or actually, no, no, no. Actually, this is actually to send an SMS. So in the other demo that I did, in the other demo that I was doing where every time somebody would log into the server, it would send a message saying, it's not so logged in. This is what would happen. So this first thing here is the ID declaration. So this is just a name I made up for this. It could be anything you want. Local is, means I'm going to run an assault execution module, which is Twilio, send SMS. Target is going to be Boucher, which was the name of my server. So that's the name of the ID of the server I had. I had to pass in some arguments. My Twilio count was just the arbitrary name I'd given, the credentials. That's from a Twilio count. And then this thing here, the information from the event that was sent is inside this data structure that's inside of Jinja. So this double brace indicates we're inside of the Jinja environment. So I'm going to get the host name. So the host name logged into and then the ID of the server and then the timestamp. I'm wondering if I had that, that's wrong. Seems like it. Yeah, so, yeah. I haven't used this in like a year, so I'm not sure if that should work. But essentially, here I'm creating a text that we're going to send in that. Yeah, so I think this could also just be name. It depends on the data structure. So I would have to look at the data structure that gets sent when someone logs in. So, you know, who's just named, you know, for example. That's essentially the text that we're sending in the data. I'm sorry, in the text we're sending. And then this is the two. So that's my cell phone number. And then this is the from, this is my Twilio number. So what that means is this gets called whenever a specific tag is met. So if we come in here. So in your master config, I don't have one right now, so actually maybe I do. So in your master config, you would have something like this. So your reactor is the initial config section. And then you have a tag you're going to match on. So in this example, the tag is salt minion. And then we have star, which is an actual wild card. So in reality, when a minion must start up, you would see like minion 01 there, where that asterisk is, and then start. So every time the master sees that, it will call first this start.sls, and then this monitor sls. So in our example here, we would have a reactor and then we would look for the tag that came up for that login beacon. So you can also send your own events as well. So here's another example. When you get an event with a specific key, you would take that key, delete it, you know, with this execution module. So really, you have the full power style to do whatever you want. So you can run stage, you can do, you know, a whole variety of things. So people that use reactors to fix problems that have known solutions that happen regularly, you know, the application developer just won't fix that thing. You can have salt fix that for you. You can send events, so if something happens, you can have it fix that. They fire in order. So they start at the top and then move its way down. You can also do things with engines. I haven't actually had a chance to use engines yet, but they allow you to create a Python interface. It's like your own little program, so you can do whatever you want within this, you know, a little Python program. See, there was, oh, it's just something else. I forgot what it was. Okay, anyway, that's pretty much it. Yeah. Well, they're more complementary than either or. So a beacon is the thing that's actually checking your file system. And it generates an event. So it's dumb in that it doesn't know what's going to happen. You know, you don't have to configure a beacon to do anything. It just check this thing, check it, check it, check it, and then send an event when the conditions are correct. And then you would use a reactor or an engine to listen for those events and then do something. It's not what you kind of get a separation of concerns, because you may not want the beacon running on that server, you know, it could be in a semi-hospital environment or something, or, you know, you may not want it to even know what's going on. So it would send an event, and then the master would take care of doing something. Yeah, a master always runs a reactor. Actually, you know what? I think you can run reactors on minions now in standalone mode as well. But in the general case, the master runs the reactors, yeah. Great question. Well, that's pretty much the end of my talk. I'm going to stay and answer questions. Yes. So the question was, when would you use a reactor versus using an engine? I think you would do that when you need more programmatic control of everything that's going on. So when you're using a reactor, it's simple in that you just have, you know, the reactor listens for a tag and then does execute this list of SLS files, and then those SLS files are still essentially, you know, YAML with some GINJA. And, you know, for more straightforward things, it's pretty easy to go. Sometimes it can be a little obtuse trying to figure out the data structure of the event that came up, you know, and that can be a little tough. Whereas, when you write an engine, you're basically, you're just in Python. So you kind of just, you get to decide all the flow of that, what's going on. And you get more programmatic control. Whereas, you know, if you, my experience has been, if you're doing a lot of GINJA templating and things like that, I like to keep it at a minimum. If you're doing a whole bunch of it, you're going to get into a big mess and give you a big headache. And so if you were doing complex things, I would look at using an engine. You know, and you'd probably get to define what complex means to you, but that's kind of where I would do that. Yeah. Sure. Great question. This question was, he mentioned that there's a lot of configuration management tools out there and what led to the creation of SALT and what are the main differences with SALT. So Tom Hatch was the founder of SALT. He worked for, first for a lot of large federal government agencies managing large, large infrastructures and also worked for a music startup that also had some large infrastructures. And there were a couple of things he just craved as the systems architect was, one, to know what's going on with the servers, you know. They would keep track of servers like in a CMDB or just in a spreadsheet and it was always out of date, right? You know, everything should have this version of the software but you go check in reality and things that happen. People have updated things, some things auto-updated. You never knew what the current status was. And so he really wanted to be able to have that remote execution. He wanted fast control over his servers. That was kind of some of his main desires there. And he had used basically all the tools that were out there at the time too. And so that's where SALT started was the really fast remote execution. And that's one of the things that SALT stands out with is it's really fast remote execution. It's not just bolted on after the fact. It's the core part of what SALT is. SALT is also very pragmatic. So it's very rare. In the past, the first few years of SALT stack as a company, I did a lot of the professional services, working with big infrastructures, you know, JPMorgan Chase, LinkedIn, various places. Well, I didn't work on LinkedIn's infrastructure but worked with training on them and things. And it's rare that you can come in and just be like, everything's off. You know, you always have people that are like, well, this little team over here, I mean, we're using Chef. We should have to configure our servers and we're not changing anything. We're not touching it. You know, it's not going to happen. And then there's other teams using another tool. And so the team has their own homegrown, bash, Python, Perl behemoth that they've got going. And, you know, it's rare that you can just swap it all out. SALT makes it very easy to use it where it really makes sense for your infrastructure. It's very pluggable. So both from what SALT can do, you know, if SALT doesn't do something you want it to do, you can create a new module and state for it very, very easily. If you have, you know, especially for an execution module, it's very, very basic Python. People that never programmed before, people that are experiencing other languages, you know, the Ruby developers, you know, it's very straightforward to get involved with Python. But also, connecting to SALT. So we have a whole bunch of external authentication modules. Some people in this room here have written their own external authentication modules, pulling data, configuration data from their existing CMDBs. You know, you know, they have their own CMDB that they had written, and it's not going anywhere. That's the source of truth for a lot of what they do. And just wrote a module, and now SALT pulls data out of their CMDB, and it just fits right in. It's very easy to tie into everything that you already have existing on your infrastructure. We have puppet modules and chef modules that allow you to do chef solo runs and puppet runs manage it. Wikipedia does that. They use SALT to orchestrate all their code releases and all their puppet runs, but they have something like, and the last time I talked to them, they had like 30,000 lines of puppet manifests, and they just didn't have the time or the desire to swap all that out. So they use SALT to manage that, to scale their puppet stuff out. We have Ansible, what do they call it, a list of servers? Playlists? No. Pardon? Not the playbooks, but they're a list of servers they have that they run commands against. I'm trying to blank, but you can use SALT. SALT can use those to run commands on servers. You know, we were very pragmatic about, you know, integrating with, everything that you want. So there's something like 15 or 16 different pluggable interfaces that SALT lets you touch and add onto if you need. One of our big customers in San Diego, they're spending like $5 million a year on their previous configuration management software that they were using, and they swapped that out for SALT. And the fact that they could get to the source code and they've basically entirely revamped one of the modules. So everything else pretty much worked great for them. There's one module that just wasn't quite doing what they wanted. The fact that they could get on, do a pull request, and make it work right, and fix that bug, was just fantastic for them. They just, you know, they loved it. So they have the support of SALT Stack as a company, but their expertise that they have on staff, Sysadmins, and developers can go back into SALT itself. In fact, a lot of times, when you're running code for SALT, you're running code that, you know, experts that LinkedIn that are managing 100,000 servers, you know, a lot of the scalability that SALT has was based off of some of the larger users running into things of scale, because we don't have 100,000 bare-metal servers sitting around a closet there at SALT Stack or headquarters to test against, right? So a lot of the best practices and lessons learned from some of the top people in the world get all the experience, flows back into SALT. Like the reactors, so there's people here in the room that were working with them, and we were working on re-imaging, so they had a tool that could re-image a server and just completely, you know, put a whole new OS and change it from like an application server backwards or vice versa. So we're working on automating that. Then it occurred to us that like, well maybe, what happens if something goes crazy and suddenly we want, it's thinking it wants to re-image, you know, 400 database servers, you know, that could cause a major problem. So we added an, we added an acu-ing system. I'm not sure if we actually got that fully implemented there, but added an acu-ing system that would allow you to add things to a queue, so if there was like five that need to be re-imaged, okay that's normal, but if suddenly there was like 400, you could say, stop, and you know, do that in an orderly way. And so, I don't know, I could go on and on, but there's really, I've done professional services with some incredibly talented people, and I've never used a tool that was so flexible. There's like, I've hardly ever run into a situation where Salt just could not do something. That's correct, so even in our enterprise software, Salt itself is Apache 2.0, completely open source, and for a couple years we actually provided packages specifically for our enterprise customers that we supported directly, and we're moving to a framework instead that they'll basically use the same packages, but we're building a tool called RAS, and a web GUI associated with that that sits on top of your masters that will allow you to, there's really large scale to keep all that data and scale out to the tens and tens and tens of thousands of servers and view that at a, at a pretty massive scale. So that allows us to provide value for enterprise customers at these difficult, large scale situations, but also have, you know, everything else open source. So if you contribute code back to Salt, it is all Apache 2.0, so you're not going to, like, contribute code back to Salt, and suddenly it's only in the enterprise version. I mean, it's all open source, and it's available for use. That being said, we do now have support offerings for just open source, so, you know, if you have, you know, 500 servers, and you would like to have support, but you're just running straight open source, and we have a lot of tracks for that as well. So, yeah, for a long time we didn't require you to do enterprise to get support, but we don't at this point. Yeah, so the question was the difference between writing a formula and having everything be, like, in your top file or, I guess, in your file roots. So, Salt Formula, if you go, we have a GitHub organization that we keep all of our Salt formulas in, and there's, I don't know, I think we're into the hundreds, I think, now at this point. So, Salt Formula is kind of a generalized, kind of a best practice as a way of doing something. So, at this point, they're pretty much generally, they're all community supported. We do provide some support to that, but our community really is rallied around and worked on those. And it's, so basically you can just clone that Git repo or you can just Git FS on that Git repo. And so, like, most of them straight out of the box will give you an installation of, let's say, MySQL or something, with some sane defaults. And then they're set up to allow you to pass in variables to configure that simply through, like, the pillar system, that type of thing. So anytime you have, like, a generalized thing, Salt formulas are a good way to look at that. So you have a, you know, MySQL expert on staff, and she has everything just tuned just right. Well, other people in your organization can use her Salt formula to do the right thing without having to be an expert with MySQL, for example. And so that would be kind of where, whereas maybe somebody else has, you know, a light H2PD state that they just use and so they would just stick that on their file roots on their master and, you know, no one else really cares about it. So they wouldn't necessarily make that like a public, or at least public within your organization. The good thing about using the ones on online as well that we have is that, again, you're using something that people have already gone through the pain of experience to get going. So, you know, obviously you need to test them to make sure they work according to your needs, but sometimes it's a good way to get started on something complicated without having to do it all yourself. That's essentially what a Salt formula is. Yeah. There are three out right now. Actually, no, there's there's four. They're all pretty good. So there's one, there's two by Joseph Hall who is one of the original four co-founders, or original engineers of Salt. One's like kind of getting started with Salt Basics, and then he just released another one that's more advanced use cases. I haven't read the second one yet, but that just releases Joseph Hall. I'm not sure what the name of the book is. Yeah, there's another great one that's written by Colton Myers. And then a former engineer from LinkedIn also wrote one as well. Drawing a blank on the name of that, but there's like three or four books out. Yeah, we have some Salt there. We haven't done a lot of video tutorials yet. We do provide Salt training both in our office. We can do it in your office. And we also do it remotely. So we have a remote video conferencing system that works really well for training as well. So those are really, it's actually a really great training. Our founder was actually a four red hat for a long time. And so is the training class. How do I find that? If you go to www.saltstat.com, we'll have some things on training. We have a guy on the back here that would love to talk to you about that as well. He's got that green jacket. And I know I'm very biased, but it's actually one of the better trainings I've ever been a part of. We do have everything Salt can do. As well as getting your hands dirty doing all the configurations and even extending Salt to an extent too. So you really come out with a great boost. In fact, when I was doing professional services, I would basically force them to do a training the first week we did stuff because it just helped for them to like know what we were doing because they had the training already. All right, I think we're out of time. I can stand up for a few minutes afterwards,