 Hello everybody. My name is Martin Moczka. I'm working as a senior network engineer in Red Hat, especially Red Hat IT. So this is not a presentation about code we are developing, but more of a tool we are working on to help us manage our network devices. And how we are using our products, for example, Ansible is to manage our infrastructure from operations point of view. What we did before, we did at the beginning, we did manual configuration as everybody does. You know that you can't standardize with manual configuration because everybody has different approach to configuring devices. It's time consuming because you need to type every single command or you prepare the text file, push it. And you have higher error rate because it's easy to forget something to configure. I don't know, VP-Douga or spinning-tree only interface and it can kick you back in the future when you have a loop or outage. Then we moved to configuration generated by Python code. That was kind of the beginning of our automation where we have standard at least at the beginning because we had some vendor agnostic and abstract language. We used to define our infrastructure and that was used to generate the configuration itself. It was faster. There was lower error rate, but still there was copy-paste issues because you had to go copy-paste the issue console or something. And on console you don't have such big buffer so you have to take just pieces of the configuration to configure that. You can forget something and again it can lead to errors in the future. And on top of that we did incremental changes manually so as time goes we are losing the standard errors. Again everybody has different approach to the configuration. So what are the reasons to automate infrastructure? Most important reason is infrastructure in growing faster than operations team. It happens a lot. You don't have a headcount, but you need to grow. And there is a point where you can't just handle the infrastructure already have. You can't implement new things. You just need to keep the state and hope for the best. Another thing, configuration standardizing, it helps you with everything if you have configuration standardized During troubleshooting you just can skip the configuration part and see, move to the device itself or current particular application. Faster deployment and patching. So when you are deploying new office, new data center, you can just automate that work and focus more on the design side. Same for patching, you figure out something during troubleshooting you can easily apply that to other devices. Building bigger ecosystem meaning that during automation you need to build some documentation and some definition of the state of your infrastructure and that can be used by other tools, self services, monitoring documentation for example. And of course because you can, we are lazy and with automation you are just getting rid of the boring work. What structure of our tool we took, it's based on the ammo files. Before it was something, the language we used before it was something similar to XML, but it wasn't really XML so it was not machine parsable. We had to write our own parser and it was not scalable enough to use it in different tools. So we moved to the ammo which is standard and fits good with Ansible and Python. So in the ammo files we define all the standards about our infrastructure. We define the relationships between devices within our campus and between campuses. So we can easily say this device is connected to that and while this device I'm connecting to that campus for example. Then we use Python code which I tend to say is acting like factor. So it's just generating facts about the campus, expanding them using our standards and calculating stuff like BGP peering between nodes, point-to-point IP addresses, VLANs of those links and stuff like that. Then we use Jinja 2 templates, again fits pretty well with Ansible and Python. It's standardized and for us it's, I will explain later why we decided to use templates. Those templates are called by Ansible roles and they basically create plaintext configuration files. And on top of that we are running Ansible which is calling factor to generate those facts, load them as variable, pass them to the templates and then templates just generate the configuration. And when it's done Ansible figures how to connect to the device using facts so you don't have to define manually management IP address because that's generated fact and it knows how to connect that. It tries multiple ways, it tries inbound IP address, management IP address and it can fall back to the console access. Quick example of how the ML file looks like. So to define the device, we are defining what the hardware is, what's the vendor, what's the sequence number, it's just for calculating stuff, what type is it. VPC is taken from the Cisco notation, it's a virtual portion in Johnny Pritz MC like for example, BGP relationships, like this is router reflector server with this group name, gateway redundancies so VRR, VRRP or HSRP serial number of the device, peering MAC address if that needs to be done and neighbor relationships, VLAN for example. So with this you define end state of the device, you can use include files, then there you can define other things like port security on interfaces, almost everything that we need at this moment in our infrastructure. Then as I mentioned we are using Jinja to template so quick example how the syntax of Jinja looks like. So we are using just simple things like for cycles, if conditions and variables. So you just put there simple code and generate the plain text file, what's plain text, stays plain text and you can add variables there. Uncivil tasks example, this is what I was talking about, it's checking inbound management IP address. So if it's pingable it will use that one and it will push configuration by inbound management IP address. If that's not pingable it will fall back to management then it will fall back to the console access. So you don't have to care about if the device is accessible or not because it can be new device and uncivil will figure it out for you. We have same simple or same set of tasks for multiple vendors but we are now working most of on the Jennifer devices so that's the most implemented one. So challenges we faced during the designing of the tool, using the tool is the biggest issue is that not all of us are developers. We are team of network engineers, not all of us has programming background so it would be easier to use, for example, Python you can do most of the stuff there to do the templating or something. We had that before. It wasn't scaling because nobody wanted to touch that when it was working. So then we decided that Python code will be used only to generate facts. You don't need everybody to participate on that. You need designated people like two or three people who know Python can participate on that because you don't have to touch that code that often. And we moved to the Jinchat tool which is simplier and when you look at that it actually looks like the same as the configuration we know from the devices so it's easier to add something there. And what we did to make it even easier is that you have facts in the YAML file and if you don't have to touch those in the factor they are just passed through the factor and they will appear as variable in Jinchat too. So you can define that in the YAML file and use it at the end in the template. Another is that we need to enforce standards and how to deal with manual changes because to be honest network engineers tend to use the same approaches they were using for years and it's not easy to change their behavior. So first we were thinking about incremental changes. But that's difficult to implement because you need tools which are able to know what's currently configured on the device and how to change that to get the final state and then you have conflict of what you can delete or not and what you can add. So maybe that's for example approach like just configuring one interface and not caring about the other things but network devices are more complex at that point because you want to change VLAN on one interface but what if you have two VLANs or when you are adding VLAN you already have one. Do you want to transfer it to the trunk or remove it, replace it? What about BPDU, what about STP and stuff like that? So we then moved to the override on every round so we are generating full configuration and we are replacing what's on the device. With that it's easier to implement. We're just defining end state, we don't care about manual changes and we are enforcing everybody to use just the automation for changes. Of course you can do outages during troubleshooting, you can go there, do manual change but you need to go back and fill it back to the YAML files so our automation tool knows about changes you did manually and it's more or less true enforcing and auditing because you can have it running using dry run and with that there was in the configuration and with that somebody did manual change it can alert on that and you have pretty good auditing with alerting. I would like to show you a live demo. First I want to just introduce what the demo will be about. So let's say we have a lab team that needs to change access port setup for one day, just one day. This is current configuration, it's from Johnny Per. Let me a little bit describe what's there. So we are configuring the interface range with two interfaces and we are setting LACP on that, which is AE10, it's aggregated interface and we have native VLAN 200 on that, 100 VLAN as tech because we have MCLAC, we have MCAEID, LACP system admin key, sorry and we have BPDU blocks, time control, everything. If we need that, this is desired configuration without port channel because we are splitting that, that's what they need, split it to two standalone interfaces. This is what the end state we want to achieve. With manual approach it would look like this, sorry, single delete stuff. So with that you need to do delete interface, range, delete interface, delete BPDU. So this is what I'm talking about. You need to know what you are deleting in order to set something new. That's why we decided to go for replace everything in automation and then you need to set the new stuff. In our tool the change would look like this. You are taking the channel configuration and the VPC, which is MCLAC in this case, away and just keeping the rest. With that you easily convert those two interfaces from port channel to the two standalone interfaces. As you can see our tool is using ranges as well, so you can simplify the ammo structure. And now I would like to move to the live demo. This is the ammo structure we have and if I go here and run it, this is a wrapper we are using for Ansible to even more simplify that for our network engineers so they don't have to deal with Ansible syntax and how you should run Ansible playbooks and stuff. So you just define what file to take, what device. Then with push you are saying or dash P you are saying push configuration which means push it but don't commit on Janipur. Basically dry run. You can see that we are now in the state where we have the port channel and nothing has changed so everything is okay and success. When I remove that and push it. You can do dry run on the change. You see that's easy. I'm usually configuring labs like this, just edit for 20 devices, go for a coffee and when I get back it's all configured. So better than going to device by device and adding configuration manually. More time for interesting stuff. And now we have the diff so everything we would have to do manually. So removing the interface range, adding new standalone interface, another one, removing the aggregated interface, adding those standalone interfaces to the BPU block and adding them to VSTP. This was dry run so nothing has changed. I can do this, prepare configuration, do dry run as many times as I need to be sure that my configuration is the desired one and I can do it weeks before my maintenance. And during the maintenance I can actually focus on the work itself. I can push the configuration with dash C which stands for commit and hopefully it will commit it for me. This is pretty simple task. I have one more demo, I'm not sure if it will work as I expect to but that's live demo. I want to show you how to configure a completely Amnesiac mode device. So the imaging device which was unpacked from the box connected to the console only. We define the whole structure, you can see it was committed and now if I do just dry run it should say all okay we change the configuration as we need it. Back to the next demo. We have unpacked device completely empty. We just in YAML define how it should look like when it's configured and after that hopefully when I run our tool it will deploy the new device from scratch. Let me just show you it's really in Amnesiac mode so you can trust me that it's completely empty. Maybe the console reacts, yep. So Amnesiac when I do login without password it lets me in so it's completely free empty. Now just log out from the console, go there with dash N I'm saying it's new device push and commit and now unfortunately I can't show you the full output because in the diff there will be a hash of our password so I will just show you the end state. Yep, but still. So I know that Q&A is supposed to be at the end but we can use the time in the meantime and you can ask questions now if you want or if you want me to show you something about that, that tooling. I was thinking about that. Yeah, sorry. If I plan to open source it and open the tool to the community I was thinking about that actually. It's not easy so I think at some point we will open source maybe just part of that because it's written for us, just for us and what I'm trying to share is the idea behind not the tool itself because the tool is written just for our standards. Some of those are hardcoded because it can be done as... On the Jinja 2 templates for example when you are writing them we are not using all the functions that are available on that platform. We are just focusing on what we are actually using to keep it simple and you can open source such thing because well you can it can be extended at some point but the factor for example it can be really open because it has hardcoded standards of our own. I was thinking about completely excluding those facts or those standards out of the Python code and use the Python code just to read that from some YAML files. We are doing that for some things. I don't know. ACLs for example on our devices all the configuration about the VPNs we are using and stuff like that. Let me just have a quick look. Failed when serial number is not in campus. Okay awesome meaning that the device has different serial number. That's safety features because when you are using a console only thing how you can recognize the device is actually the serial number. So let's take that one go to the definition file and see if we can change that. Or if it's the same it's the same. I am sorry wrong console server. That's it. So let's fix that because for the console access we are defining the console server. Okay. One more time. I can let this run to see how it works. So if you are planning to write some articles that's maybe in future. That's part of the reason why I'm here standing and showing what we are doing and mostly focusing on the demo and on the... Okay sorry. Now focusing on the demo and on those parts what were the difficulties we met and what's the structure and how far we are enforcing the standards by replacing whole configuration because when you look at the network modules in Ansible for Cisco, for Juniper, not really for Juniper, they are I would say a step ahead because they allow you to replace whole configuration but Cisco is a... Cisco they allow you to push configuration line by line and that's not really how you should configure the network device because there is so many dependencies in the configuration that you should think about using the replace mode and push whole configuration line. With that you can get rid of those defaults that vendors are putting there and with new version of firmware they are putting different defaults which can affect you in the future again and if you replace everything you don't care about those defaults you just care about the end state which is... Yeah I think we should definitely write something describe what we are doing this like that and we started by me giving this talk so hope that we will continue doing that effort so I keep checking the state it takes a lot of time, yeah so yeah if you understand correctly you are supporting multiple vendors okay we are mostly focusing on Juniper I will have slide about that but we are focusing on Juniper we have multiple different platforms such as non-ELS, ELS which is X4200 for example and X4300 QFX, SRX, MX platform so we have almost all I think even all the platforms we are using from Juniper already supported we are now working on Cisco AZA because it's pain point for us managing KCLs and we have a lot of requests like that so we are focusing on Cisco AZA we are on a good way but something which is slowing us down is the approach again replacing whole configuration which is not easy on AZA so we are working on our own module and we have sample configuration sample roles for cumulus for example where we are trying if our approach is vendor agnostic enough to take same facts and just pass it to the cumulus and we saw that yes it can be done and I don't know if you know FRR but it's using almost the same syntax as IELS on Cisco we can do FRR on cumulus we can do Cisco as well we saw that we can do multiple vendors but we started with this spring last year so it's not even a year we started developing everything around okay this is taking some too much let's see what's happening pushing through console is it's not that easy especially not for live demo because console is kind of sensitive for all the timeouts and stuff so let me just rerun that once again sorry for that if that fails again I will just keep moving but so yeah if that answers your question on the multiple platform support we are trying our best but we need to manage the infrastructure so we are taking it slow and we were actually waiting if uncivil network team will come with some support for whole configuration replays it's not that easy but we are working as I said we are working on our own module so one interesting stuff when the demo is running you can see it's opening SSH tunnel that's because John Limper module is capable of using console server but only via telnet and that's not secure enough for us because we are pushing plaintext configuration through that channel and telnet is not good enough for that as it's not encrypted so what I am doing here is that I am opening NCAT SSH tunnel to the console server and exposing that locally a telnet port and the module is connecting on my local machine to the port to push the configuration via console it got the serial number you can see that now it was fine it was okay that's verification if you are accessing correct device and now let's see if it's doing something I think it's breaking it up I think I just picked the wrong device because this was the I was doing most test on and it's kind of sluggish on the console yep, it's stucked on the I know how to work around it actually not enough to work around it last one I'm sorry, it's like demo and console access are we are using our console access instead of zero touch provisioning because we are not using that much these DHCP options and most of our DHCP options are already taken by different devices like PDUs and other stuff that's why we fall back to the console management directly but since it's not working as it should be now let's take a look we are doing something at least yep, it's doing something but it's taking too long because it's console so let me just move in my presentation to explain some of those things I was talking about already and we will get back to the results if there will be some so conclusion of the tool and the structure and everything we discussed today platform support, as I said jennifer switches routers, firewalls are fully supported those platforms that we are using not all of them but those we are using Cisco ESA is in progress we have templates already some of the templates we have almost ready module it has to be tested so we are mostly waiting on on the testing Cisco IOS and others we are not using them that platform that much so it's in planning but lower priority and as I was talking about cumulus we have sample support for for example interface configuration BGP peering, OSPF and other common protocols what it can do on the operation tasks it can deploy new device that's something we will see in few minutes hopefully configuration changes via console and SSH so it can do any operations task it can disable port, it can enable the port configure VLAN configure anything that you can imagine that modern network is using including EVPN which is quite new how it can serve in the future what we are hoping for is living documentation because you are putting all the facts to the standardized demo files that can serve as documentation because everybody likes to create documentation this is how you are enforcing people to create documentation it can be used as source of truth for monitoring so it knows about all the interfaces how they are configured all the IP addresses that it should focus on all the protocols and services that are running on devices so you can use that in the monitoring it can and it's already in progress serving as self as backend for self-service for some other teams so for example we have IT support teams that are communicating directly with customers our customers which are red employees we are exposing part of the tool to them so that first line support they can configure simple things like change wheel on the access port and stuff like that description for example so it can serve like that we are working with our labs team to expose another part of configuring labs so they don't have to wait for us to react on ticket ask them details configure the device we are basically exposing everything to them in labs they can change access port configuration on their own and everything is in git we have versioning of the configuration and we have auditing because for example labs team is using merge requests after they push it so we know what was changing and we can react on that on janniper we are heavily using comit confirm so we can say to roll back the configuration in I don't know 10 minutes so if something breaks and you cut yourself out it will roll back if not our tool goes back confirms the configuration everything is fine so let's see on the live demo side nope I'm sorry I can't show you the configuring new device I don't know why it's stuck I'm pretty sure that this device is the broken one so I need to kill that unfortunately it's not working as it should be which is too bad I think that's the most interesting stuff now it's time for proper Q&A if you have some additional questions yep so if we use open config or investigated use of open config or young we did we decided to go our way because of our standards and to simplify it even more so we are adding a lot of stuff which is oriented just on our infrastructure to simplify it as much as possible so that's the reason why we chose to use YAML only the issues like with the console for example how we debug the issues for example as with the console it depends on the issue definitely on the console server we are using for example it's running in the screen so you can connect another session to the screen and see what it's actually typing to the console it helps you to see what actually broke so that's for the console for other stuff we have verbose output as well so we can use that one we we are using our own playback module in the non-sible so it shows us well not more maybe a little bit more but different structure to show exactly we want from the output and biggest advantage of that is that all the generated configuration effects are stored as files so even that the unstable around fails you can go back see those files and figure out what's missing or what's incorrect that's mostly during the development part where you need to check that the effect was processed correctly and it's appearing correctly in the configuration so we are using that yeah sorry what's the size of the first session you're handling with so the size of the infrastructure unfortunately I don't have pictures but in numbers it's something around 62 campuses some of those are DC's whether we have less than or something around 2,000 of physical devices sorry logical devices physical is more because we are so multiple hardware devices are appearing as one logical device addressable from your terminal center right? all of them have to be addressable all of them have at least one IP address you can access them on so if all of our devices are routable we are using SSH to manage them all and our tool is using jump host so we have allowed access to our devices from limited set of nodes that's why we are using jump host I think we are fine so thank you