 All right, thank you all for being here in the SysAdmin track. The track is sponsored by Stack IQ, developer of Stacky, and open source bare metal Linux installer. So thank you all for being here. We have here today John Merritt, director of managed service with Stack A Technologies, doing a talk on network device configuration, standardization with Trigger, as you can see. So thank you very much for being here. Thank you. So I'm talking about a subject that's very interesting and important to me. And to lead things off, how would you like to be able to reconfigure 1,000 sites or devices in an hour instead of over 20 days of manual work? That's the type of challenge that I was facing that led to me developing the tooling that I'm presenting here today. So I was working on a migration project for a customer. They were moving between two different DMVPN infrastructures and that engaged us to help them move through that progress. And they provided us with the scripts that prepared to do it, which needed to be customized for individual devices, that you needed to validate a bunch of operational state. So say that it took you 10 minutes to make that change on a device, to check and make sure everything was good, and to paste in the substantial volume of new configuration required at a rate that didn't cause iOS to start ignoring half of the commands that you typed. What I did then was I built out a script that validated some show commands, made sure that everything worked well, and made the change. And we were able to automate that process and bring it down to a couple of seconds per device after connection. And I've developed those same techniques through our different practices and projects with customers and in managed service where we're managing environments with large numbers of remote CPE devices. And the stuff I'm going to show you today is really kind of the culmination of the work that we've done on that in terms of going beyond even individual changes to devices and into a structured configuration management approach for network devices. So let's bring this over here. The agenda for the day, first I'll tell you a bit about myself. I'll talk a bit about my feelings on the state of the networking industry. We'll look at the approach that I propose for network device configuration management. And that'll run through the structure of the code that I've created that's now integrated into the trigger project. It is something that may be a little less accessible for someone with more network background and less Python background. But you'll see in the subsequent section on adapting it to your environment that once that framework is built, extending from it is a lot easier. I'll talk about a few planned improvements that need to be made to the tool. And finally, we can take a look at the questions that the audience will hopefully have about the subject. So I myself studied programming when I was in college. And when I entered the workforce, I started doing system administration. It's now been distressingly about 20 years that I've been working. And in that time frame, I worked more and more on system stuff, automation, and then I got more into networking. And when I changed employers a few years back, I then got into much more managed service and a lot more large scale network device automation. My responsibilities at Stack 8 where I work, I'm the director of managed service. So I'm responsible for working with our environment to help our customers manage their devices. In some cases, it's professional services engagements. In others, it's where we're providing a completely managed service like a telco would for remote site connectivity. I work in both network and security stuff. We do large managed service environments. And presently, we're managing about 2,600 of remote CPEs in a variety of different environments. The tooling that I'm presenting today is fairly Cisco focused, at least in the code that I've written. However, the principles behind it are things that would apply to really any platform. And if you're fortunate enough to be using another platform that gives you XML interfaces more advanced methods of retrieving configuration in operational state, the same stuff can be carried into those environments as well. On the state of our industry, it sucks. The networking industry is very much, if you think about where system administration was 10 or 15 years ago, where people were very happy to manually craft each server and hone each individual element of its configuration. That's really changed a lot. And even in more recent times, DevOps, tight integration between different groups, all this stuff has really changed a lot. And what I've found is that in the network industry, those changes haven't really happened so much and haven't happened in a structured way. There's a lot of manual processes involved. I know that there's a very major telco up in Canada where I work, where they actually build the configuration for devices with their engineering teams using notepad, because the engineering teams don't have access to the network devices. And then field technicians paste those configurations into the devices in the field, but they don't actually know how to work on network devices. And hopefully they report back if errors occur during the device configuration process. Now, hopefully no one here's situation is that bad, but there's certainly a lot of room for improvement. And when I look at the things that people are doing to improve the situation, you have a lot of vendor proprietary tooling. Each of the different vendors has their kind of suite for managing or attempting to manage their devices. And you also have commercial tools that are also quite expensive. And often, suffice to say, I've seen some very expensive implementations of SSH in a for loop. So I feel that there's a lot of room for evolution in this space. And I think that the open switch stuff that we're starting to see from the open compute project, cumulus networks, things like that, I'd like to think the beginning of a sea of change like the transition from traditional Unix environments to Linux was. And hopefully we'll be able to really improve things in the way that other parts of the IT industry have changed. In the configuration management space, and I imagine this probably applies to most any part of configuration management, not just for network devices, I first am looking to identify specific issues. So we might see VPN flapping. Where specific devices are having a problem and seem to be disconnecting or reconnecting to the network. Doesn't really seem to have an operational impact, but at some point it will. And you want to understand how big of an issue is this, how common of an issue is this. So I'm identifying either issues in the environment or new projects that need to be delivered for a customer. And then we're generating reports that allow us to sort of examine the information about the systems and see how serious a given problem is to allow us to kind of prioritize specific changes that need to be made, what's more and less important or pressing. Finally, the tooling that I've developed allows us to then make those changes to the devices. And in my environment with large numbers of remote devices in different field sites, I don't have all of my devices operational at any given time. Something's being serviced, there's an outage in a given area. So it's important that the process that we use to make these changes is able to handle a device not being at the current reference, examining the state of the device and determining what changes need to be made it to bring it to the desired configuration. Some changes as well are gonna be dependent on operational state. For instance, if you want to make a change to the cellular internet connection on a device, but that's the active internet connection and the primary internet connection isn't present, that's not a change you wanna make at that time, especially if the consequence of it is losing the device. To achieve these goals, I've used a number of different approaches. Today I'm speaking about the trigger-based approach that I'm using now, but I've used a bunch of other tools and probably maybe some of you have as well. So Rancid, which I suspect is familiar to a lot of people in terms of a tool to collect network device configuration and report on changes in it, also has the ability to do custom scripts using C-Login, its tool for connecting to devices. So I started on that first migration project I spoke about before, writing scripts in TCL and pushing them out using C-Login. TCL is not the greatest language ever. Two little pet thieves I discovered during that project. If you put a comment that includes a parentheses, that will still be interpreted by the parser, but it does at least give you a warning. If you do like an if, else if, else construct, but you typo else if and don't spell it in the specific way TCL likes, which is different from Perl or Bash's, it'll tell you that else is an invalid command because it's, I think it might have interpreted the typo of else if as an anonymous function or something. Anyway, it's really bad. I transitioned from that to using Perl and I'd move all the logic out of the TCL script. So I didn't need to do logic in TCL. I'd get state from a device, I'd parse that state, I decide on a method of changing its configuration. I'd generate a TCL script that I'd then run with C-Login. So it got rid of a lot of the pain of TCL, but it wasn't that much. It was a big improvement, but there was still much more ground to be gained. I've cut back on abusing Perl and switched over to Python, which has a much more structured approach. I know it's possible to write clean Perl, but I can't write clean Python, so it's not gonna happen with Perl. And then most recently, I discovered Trigger, which is a framework for interacting with network devices. It provides a fairly elegant interface to connect to devices, but it is very much a framework. It's something that I've built tooling on and that other tooling can be built on either directly on Trigger or on the stuff I'll show you here today. And that has really improved the methods that I'm using and able to, yeah. So in terms of the steps that we have in the process of making change to a network configuration, first, I'm gonna talk a bit about the approach that I use within the code that I've got, which is generalized whether you're doing reporting or making changes to the configuration. Then I'll talk a bit about the specific implementation of normalization. It goes a bit into detail on the Python code, but I think it's important to have an understanding of how that framework has been built, reporting which leans heavily on the normalization approach, and then we'll talk about customization. And I've built out in the presentation a concrete example of a new type of reporting that I did just for this talk to show how easy it is to take the work I've done in understanding how these things work and carry it into customization for your own environment in a more, which is a little more straightforward than the groundwork that I've done. And finally, we'll talk a bit about some improvements that I'd like to make to the tool. I left out something important about Trigger before, one of the big advantages compared to C-Login is that it is able to connect to multiple devices simultaneously. And because of that, if you wanna process stuff for 1,000 devices, it's a lot faster when you're doing them 10 at a time or 20 at a time as opposed to connecting to them in serial, especially when some of them are over cellular connections or are not online and you're waiting for them to time out because they're not responding. Specifically, the approach that I've taken is to first collect the configuration and device state. So we're looking at what the running config is on the device, what it's operational parameters on, then to parse and structure that information so that we have a clean representation using objects, dictionaries of the device state, which we can then analyze and determine if there's changes that need to be made to the config. If changes are required for a device, then you make those changes and you, I also store all of the information I collect from the device into a repository of device information. It's useful when you're doing reporting because you can do reporting without connecting to the devices, so you can reference that stored state information if you wanna see, how many devices have this specific config in place. You have the information local and you don't need to go out to the devices again. Inside the example code, we have four critical files. We have router.py, which has the kind of core logic to this system. There's normalize.py, which is a script which actually will perform a normalization on a device, report, which is a reporting script, and I have a device list with three routers with highly traditional names. So if we look at how that, is that, okay, good. A previous presentation I saw in this room, a few hours ago, the text was illegible. I think this should be pretty okay, but if there is a problem, please let me know. So I ran normalize. It found that there were, it prompted to say, do you wanna run all sites because you can select a specific subset of sites? It chose to, I went forward with it, so I'm processing routers one, two, and three. One of them is down, couldn't ping it, wasn't processed, and one of them didn't have the trigger ACL present on the device. And that ACL that we're validating is example code, permit 1.1.1.1.1. A lot of the stuff that I've developed is very specific to the device configurations in my environment. So what I have here is a very generic example. I'll come back to that a bit when we talk about the improvements though, because I think there are more common things that different people in this room might share and benefit from that could be done in a community way, even if a lot of things are gonna be specific to your own environment. So at the heart of the system with Trigger, which is based on the twisted framework for device interaction, I have a callback processing. And the callbacks, the way that they work is we take the list of devices and we get the details about them. The returned information from that is then processed by a script that validates the device's state. And then finally, we initiate the normalization of the devices. So a device that needs a change is going to then build out the change to the configuration. If we, oops, just a second. Okay, so I called out the critical elements of the function. So in normalize.py we have the get router details, which is an instance of a Trigger commando object. It's checking to, it's got a list of commands that'll be executed and each device is checked to see if you know the device is available, tries to ping it, if it's available, then it's returned for processing. And within the common router.py code we have the actual show commands that'll be run on the device. So we're doing a show run include IP access list and we're getting a show ver showing us the version of the device. When we then take that information back to parse it, the information, so each device that was processed and the result set that was returned is run through the validate function. And the validate function is first updating information on when the device was last contacted, which is used in the reporting. And we call the validate ACL function, which goes through the results of the show run IP access list, splits that out and uses a simple reject to determine if it's a standard or extended access list and to get the name of the access list, which is added into a list of ACL objects known on the device. We then have the functionality that normalizes the configuration. So for each device we call this normalize function, which is down at the bottom in the router.py. We look to see if trigger test one is present in the list of ACLs that were defined on the device and if it isn't present, we set a variable that tells that the device does need to have its configuration change and append to the list of commands the changes that need to be made to the configuration. So if you have, you have a bunch of different checks and those different checks will see if it changes required to the device or not and all those things are built together into a list of commands to be executed on the device. When we actually go to make the changes to the device, we just go through the list of devices returned from that function and if they do need to have their configuration changed, then the commands that were built for that device are pulled out and used as the commands to be executed on the device. And then finally we validate the return from that to make sure that the right mem that's executed at the end succeeded. On that thousand device migration project, we ran into a few machines that when you save their config reported that the flash was bad, which is always interesting and I wonder how long it had been the case. Most of them were fixed with receipts, but not all of them. Finally, we store the state of the device into, in my case, I'm just doing it with a JSON object. I take the list of objects and I represent it as a JSON object. If you had a larger number of devices or you're doing something more advanced, a better storage mechanism might be required, but it works quite well for the scale of devices I work on and would extend up to tens of thousands of devices fairly easily, admittedly perhaps not with the smallest file. For reporting, the output is fairly simple. You know, you run report, it automatically will use all devices and this report just shows the device, when the device was contacted and the firmware version running on the device. And then I've run it a second time where I've selected a specific device to be accessed and you see in the report, the access time for that device has changed. So when you've got a large field of devices, you're not gonna reach all of them every time and knowing when information is from is extremely important and understanding its value or deciding how you wanna act on it. So this slide should have come before the previous one and it talks about that reporting process. Okay, so this is actually the, this is the overview of how that actual code works. So we load the state information that's present right now, we identify the devices that require updating. We connect to the devices, we get updated information about them and we generate the report for them and you'll see that there's a lot of commonality between the normalization process that I reviewed before and the reporting process now. So we're using most of the same functions that were present in the core router.py code. We just don't do the normalization. So we're doing the same validations, it's using the same validation code and the same commands to collect information, but we don't need to make any changes to the device. So we have code to output a CSV which is pretty straightforward, goes through the different routers that are present and writes out when they were accessed, the device name and the version. We speak about adapting this to your own environment. There's a lot that you can do with this to take the framework that I've made and to implement specific things that you're interested in, things that need to be validated in your own environment. I suggest that you work first on creating small tests that are exploring individual and smaller elements of what you're interested in doing and then build from that over time. You can also bring in information from other data sources. So I do a lot of stuff where I'm pulling information from monitoring systems about historical device availability. You, in larger environments, probably already have decent repositories of information about your network environment and that all can be pulled together and in. So what I built as an example of how it's relatively easy to extend is something to parse which ACLs are applied to which interfaces in a Cisco or Cisco-like device. And this single slide actually shows the entire change that I made to do that. So it is a lot easier to add things on top of what I built than it was for me at least to build out that initial example. So I've added in a new show command that looks at the running config and gets the interface sections and then inside the validate function that we're calling to validate device state, I've added in validation of the new function called validate interfaces and inside there we create a dictionary of interfaces. I clear it at the start because I had some bad experiences with interfaces being removed that because I didn't rewrite the entire object each time, it would cheerfully report really old information about interfaces that were no longer on the device. So that is important. And then for in the config, I go through the show run interface sections. I pull out the interface names. When you find an interface name, it gets added to the interfaces and then we build for each interface, if it's got an input, an inbound and an outbound ACL, those are sort of recorded into a dictionary underneath that for the specific interface. I've only done ACLs here, but obviously this could be extended to all sorts of different stuff, IP address configurations, QOS configurations. And this is kind of something where there's a lot of, this part of the validation is something that a lot of different people have the same needs for as opposed to how you take this information and validate its relevancy to your own environment. So we have the output from it and I ran it against one device and we can see that the gigabit ethernet zero interface has an inbound ACL of tests applied to it. There is a lot of room for improvement in what I've done. One thing is I've done a manual job of building this interface parsing code, but there's first a lot of it is stuff that could be applied to any device in any environment and it's something where there's more common code that could be created. And there's also a Python library called Cisco ConfPars that will build a representation of a Cisco or Cisco-like device configuration kind of automatically. It doesn't have as much understanding of how it works as something else might, but it understands hierarchical command configurations and nested devices under each other and it's supposed to do a pretty decent job. I haven't really played with it that much, but it is something that would be pretty cool. A much larger thing would be the development of a domain specific language. So right now, all of the stuff that I've done, you need to actually understand Python. And unfortunately, there's lots of people that know Python, there's lots of people that know networking, but the overlap between them isn't necessarily as strong as it could be, unfortunately, and there's not as many people that seem to be fluent in both, we could say, and bringing it to a domain specific language could make it easier for networking people with less programming ability to work on this type of system and to benefit from it. You also need to avoid creating a domain specific language more complex than Python, which, well, I think we've all seen examples where that could have gone better. Finally, another big area for improvement would be more community involvement. So I did this, I was fortunate enough to be able to kind of contribute back, open source portions of the work that I've done, and I'd like to think that there's other people that are interested in this subject that might also be able to work on this both for their own internal use and outside. There really doesn't seem to be much public work done in the network automation space and especially with open source tooling. I know that there's a lot of people doing stuff behind closed doors and that there's things that can't be shared with the outside, but I think there's also a lot of room for more improvement and more change, and I hope that this talk will maybe get a few people interested in it and that we can start to build more of a community around this and other related projects. I have many people to thank for being here today. First and foremost, my wife, my kids, my family who support me and put up with me and other stuff. Jaython, who created the trigger language framework and was able to also make that an open source project and available to other people. Without people doing open source work, there's nothing for other people to build upon. Stack8, who's given me a great place to work and to develop my own skills and ability. My good friend, Henrik, who helped me with Python and actually didn't want credit here because of the quality of my Python, but I try, I try. Charlene, who really turned this from a much more bare-bones presentation into what it is now, the trigger community on FreeNode and my friends in the Linux channel, went through that a little. And if you'll wait for the microphone. Why use trigger and not use something like Ansible? And not witch, I'm sorry? Ansible? Ansible. At the time that I did it, I don't think that Ansible had the support for the Cisco stuff. Also, I found it very, very hard to find anything on the subject. So when I was doing the C-Login stuff and hating it, I was looking for people doing this kind of work and I eventually found Jathan's presentation from three scales ago that spoke about trigger. I recently did learn about the Ansible, Cisco integration stuff, like the bare SSH support stuff. I'm interested in checking it out a bit more and really, while I'm happy with the work that I've done with this, I'm also open to other approaches. I'm looking for the best possible ways to do this. I'm looking for the best ways to scale it. I'm looking for devices, things that are more cross-platform and the more stuff that I can do with it, the better it is. I heard about some interesting stuff involving Salt and network device automation as well and I'm interested in checking out all that stuff. So while I'm presenting on trigger today and I'm very happy with what I've done with it, doesn't mean that we couldn't be talking about something else in a few years. This is not a question, but I did some stuff with trigger where it would log into devices and parse gigantic config files. And a problem in general with some of these devices is that there's no grammar for the, and you don't necessarily want to write a grammar, there's no codified grammar for things. So a lot of these things will actually dump XML and if that's available, I would definitely recommend using that versus trying to parse. So I know that Jennifer devices do that. They have fairly strong XML support, as I understand it. Unfortunately, I haven't had much opportunity to work with them. Yeah, no, I'm just saying if it's available. So obviously something that gives you a structured form of the configuration of the device would be preferred. I know that Cisco worked on some stuff with 1PK, which was supposed to be this kind of platform for automating their devices, and I went to a presentation in their offices in Montreal, and many people were there for the lunch. I was there because I was interested in what they actually had to say, somewhat to the astonishment of the presenters. What I found, however, was device support was very limited and on a lot of non-Nexus devices, it was also necessary for you to run T-Train releases. Did Cisco subsequently show? Oh, yeah, no, I'm not saying it, you know, like everybody does it. I'm just saying if it is available, it can save you a lot. And sometimes when you're parsing, you can go line by line, but in some cases, the configs are involved enough that you actually have to parse it, you know, into a parse tree and try to suss out, like this ACL is attached to this device, sort of stuff that, you know, is difficult to do without doing a full parse. And Jay, I think I'd speak to this too, but I mean, you know, like they've done parsers that ended up being sort of frank in parsers for multiple devices, but it's difficult to say, you know, is it correct for any particular device, right? In Trigger's method, you do have a method of when you define the device, and I didn't really show the details of the CSV file, you also state what platform it is or what family of device it is, and that can be used to drive both how you connect to the device for devices that have like non-SSH support. And I'd imagine you could kind of feed that information back in in other ways as well. Yeah, yeah, no, I mean, there is parser support. It's just getting to the point where it was like, oh, this is a little different, oh, that's a little different. We'd actually talked about splitting it into, you know, very manufacturer-specific parsers. Anyway. You mentioned that you, and I apologize, I missed part of the first part of the talk, so maybe you covered this, but you mentioned that you were looking at other open source options or you didn't find any, but I was wondering if you looked at any of the prior works such as NetConf and Yang, which are standards-based protocols for manipulating configs on devices or even some of my colleagues are working on OpenConfig, which is an evolving standard right now, which is actually documented and available for driving device configuration in a vendor agnostic manner. So unfortunately, I did not hear that question very well at all. And I think that the speaker-facing is much better for the audience than the presenter, but I did hear you mention NetConf. Yeah, so I mentioned NetConf and Yang, which are sort of prior art from like 10 years ago on sort of programmatically driving device configuration, but I was also mentioning OpenConfig, which is a new standard protocol, the standard-based protocol that's being developed for device configuration in vendor agnostic way. I haven't heard about those before, but I'd be very interested to check them out. And maybe if you want to approach after, you could tell me a bit more about it. I'd welcome it. Any other questions? I have a bunch of references, starting with this presentation, which should be online on the website shortly, and I've got a relatively short URL if you wanna grab it here. I've got a direct link that leads to the example code, which is in the GitHub Trigger repo, the documentation for Trigger, as well as the Trigger GitHub project page. For people that are doing, for people that don't know Python, but are interested in this subject, I pulled out one well-known Python guide, the Learn Python the Hard Way, which apparently is actually the easy way to learn Python. And it's supposed to be a very good approach to Python, especially for someone who doesn't necessarily come from a programming background. I think that if you do have programming ability, Python should be fairly easy to pick up if you don't already know it. I've got my website and my email address, and my employer, Stack 8. If this stuff sounds interesting to you, but implementing it in your own environment sounds hard, we could also talk about that. And finally, I got all the graphics for this presentation from a website called flaticon.com that has a lot of really nice, clean little glyphs, great for presentations, other stuff. Here's the licensing on the different files. Thank you very much for your time.