 Good morning guys, thank you for coming, hope you're on the right fashion, a little bit of introduction. My name is Igor Donov, manager, infrastructure operations within network. We cover network and systems operations. But I've worked in the networking team within Red Hat as a network engineer for 10 years or so. So I've come from there and I'm more or less managing the same team or more or less my experience. So I want to show our journey from both views as a person using the tools and transforming the stuff and also as a manager helping promote the tool and have it adopted. And with me I have Martin Murchka as well. Hello, my name is Martin Murchka. I work as a network engineer in Red Hat for more than 5 years now. Lately focusing on network automation and monitoring. And I'm here to more or less do the technical part of the talk. So as Martin said, I will cover the journey and then more or less show us where we are currently. Then follow up with the demo and the capabilities of the tool and the platform we've developed. And also there should be some time for Q&A as well. I'm just wondering how much of the folks here are actually network engineers and working a lot on the networking part with configuring and managing network devices. Okay. Any managers in the room? Okay, cool. More or less that's what I expected. So anyway, let's start. So have a nice slide about a road and a journey. More or less these are some of the milestones or the steps or stages that I think we've been through as a team in order to get to the place where we should be. And again, these are our milestones, the way that we see it. It might not be the same for anyone else. But the whole point of this presentation for us was to kind of show what we've been through and maybe helpful for other folks and other teams and other companies in order if they're going to the same transformation of the networking within their team. So as any other journey, it usually starts with the manual stuff where you're actually doing everything manually, which is a place where hopefully you don't want to be. It's a place probably, I'm assuming, in early stages of a personal career or in early stages of when a company starts. As a network engineer, you're actually usually tasked to do deployments of, for example, at a new data center, new lab, new office infrastructure and whatnot. And that is a time drain, especially when you do things manually. If you do things manually, just generating the configuration is usually using some golden configurations or copy and paste from a previous device and kind of trying to change that. And that's very error prone because reusing a previous configuration can cause problems because of different software versions, maybe a different, you need an edge switch versus an access switch or a core stuff like that. Just generating the configuration from multiple devices usually takes a lot of time and once you're trying to deploy it, you run into a lot of errors. Also, the part about deployment is usually manual, you copy and paste. You copy the generation you've come and paste it directly via console access to the device. And again, that's a place where it creates a lot of errors, copy, paste errors, configuration errors. You're hitting limitations of the out-of-band management if you're using out-of-band console, like some buffer errors and stuff like that. So the whole thing is kind of cumbersome. It takes a lot of time and stuff that you really would like not to spend. And it's frustrating for you because you're wasting a lot of time. And with incoming work of a lot more of those projects, you're kind of becoming the blocker for the larger delivery of the project. And with that, you actually don't have no configuration management. You're actually constantly doing something manually. And then the next step in order how to fix that, help yourself make sure that you are delivering faster. You are focusing on, okay, let's generate the configuration. Let's do that part faster because that's kind of the first milestone, the first part that you want to focus on. In order to do that, that's what we did. You actually have to have standards in place. Like decide how are we going to configure our devices? What does an H switch look like? What does a core switch look like? A router? Whatever. How do we configure Spanning Tree? How do we configure BGP? What's our security accessories? What are authentication methods and stuff like that? And once you have those, then you can easily develop a tool that can actually generate a configuration for you. You put a hostname, you put what you used the switch for, and it generates the whole configuration. And then you have, you've narrowed your work to only the deployment part, which is, again, still usually manual. So you've helped your toil in a bit, so you're kind of doing it faster because people can easily generate a configuration. With that, actually you are covering a good portion of having your tool to be used within your team because if you're making it easier for someone, your team will start using it. With that, also, you are standardizing the thing that helps making sure your infrastructure is scalable because if you're using a tool to standardize everything, then it's easily supportable. Then you don't have any special quirks here and there that you need only the local guide to fix, but anyone can fix any problems in any parts of the infrastructure. And again, there is an early adoption because people save time. With that, the next logical step is the automated deployment. You figure out how, okay, now I fixed the configuration but still on deployment, I need to do it manually, I need to copy and paste. Again, some of those issues, you're still having it. And that's the part where you're still slow at. And that's the part where we actually, we made our out-of-band management platform actually supported in this platform. And with that, you can actually, without manual deployment, we can use the tool to actually deploy the configuration. With that, we do deployments in a matter of minutes rather than days or hours. So with that, we've shortened the time to deploy an office or a lab or a data center from days or weeks to actually hours or minutes. That actually gives another boost of people actually using the tool because now it's a time saver from the rest of the team. You can have the perfect tool, but if no one uses it, it's a shame. So that's where, as part of the manager, he's kind of having kind of introducing this, why we're doing it and telling the folks that it's a time saver for them. And also the large benefit is that we're in the background, we're having things standardized. It's easier for us to actually support and all of that stuff. And with that, you have configuration managers in the first sense. People are using it, but they're using it only for green field development, only for the new stuff, only for the new offices, for the new data centers, for the new labs. But whenever they need to change something there, they actually do it manually. So that's the disconnect that we used to have. We have the people using the tool for the initial deployments, but then whenever they need to change, they do it manually. And that's the, I think, for us, it was the most trickiest way to actually change the mindset of the people that, hey, guys, now that we have this great tool, let's use it for everything, let's use it for actually doing the continuous changes as well. And with that, when you do that, actually, and that was a tricky part because it is kind of changing the mindset of the people. At the manual step, you have a pure network engineer, a person who really likes configuring devices, network devices. You wake them up and they know how to configure BGP, routing, spanning tree. And then at the end of this step, you kind of have that guy, but also with skill set, like using some Python, some scripting in order to have the platform that helps automate the stuff, also to have some knowledge in order for that. So that guy is actually participating into the tool, making sure when we're adding new features, they actually can actually work and help as well. So it's a mindset for people that they need to upgrade some of the skill set, and that's from a manager point, that's kind of the hardest part. And also it's frustrating for people that don't use the tool because they have a gap rather. We had to present a lot, kind of introduce them how to use the tool, help them with picking up some of the skills like Python or Git, how to make sure that they don't feel threatened and they can actually use the tool for everything. And once you've done this, you can actually do a lot more. So back in the first slide, just want to show where currently we are. We are actually somewhere between four and five, more or less, I would assume, close into five. We've made a lot of progress. In the last 12 months, I think we've come somewhere from here to that place. We've worn that back 12 months ago, that's good. But again, it's a mindset change that we have to fight as well, not just building the tool, but also making sure that people are really using it. And this is the current status, for example. This is actually pretty accurate. We have it updated yesterday. This is some of our vendors. Open Gear is out of band management. And as you can see, that's 100% covered. That's the part where it actually helps us a lot to increase the speed of the delivery. We focused largely on the Juniper devices, the switching and routing devices that we have. That was because they were only the low-hanging fruit of the stuff that we can actually get the biggest benefit from. And we are more or less at the 70%. We are currently narrowing down that a lot. On the Cisco side, we have not focused a lot, but that's changing. One of the things that, for us, for any operations thing is the firewall infrastructure and changing access lists and whatnot. And that's something that we are currently developing a tool in order to help us to manage that and help us there. So that's something that we will be a focus for us in the next year. So what does self-healing mean? Self-healing, more or less, means a lot to different people. For us, it means that automation and monitoring to address alerts and fix operational issues. The way that I see it, you have different protocols to figure out different problems to address, but some of them are limited or the solution in order to fix a problem can be a really expensive and proprietary and whatnot. So a good example is the first one, the switching loop. You have a protocol for switching loop. You have a spawning chip protocol. It's pretty well-known how to do that, how to protect your switching environment. You have the BPD guard, root guards and whatnot. But in a given lab environment, when people are doing weird stuff because they need to, they are actually testing a lot of stuff. There are scenarios when loops just happen. And troubleshooting loops are not fun. I can tell you that. And that's the stuff where this tool can help, especially when the alerting can notice that there is a high CPU on a switch and it can then do some steps in order to identify the root cause. It's usually finding the port that is causing it and more or less shutting it down. Another example is Dual Home 1. We have two links for an office. There's packet loss. BGP doesn't see that. It's not enough for it to bounce, to shut down so that you still have traffic flowing on one link. And more or less, the alert sees that there is a packet loss and can make a logical decision that okay, the other link is operational, we can shut it down, and then more or less send the ticket or actually inform the knock that they can actually go and follow up with the vendor. Similar example as security related, the automation and the monitoring can see a Delos-like attack and check who is doing it. It can make a logical decision about what to block and where. And this is not regarding self-healing but also having the full infrastructure, the full network infrastructure managed can help you a lot regarding automating some tedious stuff like software maintenance upgrades, especially with the recent years where we actually have those exploits more or less more frequent or more or less on a weekly or daily basis. That's something that can help. And also, once you have everything managed that gives you a source of truth on how your infrastructure is configured and you can use that in order to have asset management, you have the inventory, you have all the gear there, you know which device is running, which software version and stuff like that so you can do a lot more data with that. And again, what are the steps that we have done in order to do that? We haven't finished everything and that's something that we want to continue moving forward. We have our network automation platform connected with Ansible Tower so far when someone wants to deploy something it usually runs from their laptop or desktop and we want to centralize that with Ansible Tower with that allowing us to do a lot more on a given laptop but more or less checking if there are changes and then do deployments and other stuff. Also, the network monitoring part the old way of doing monitoring was you put a new infrastructure you switch somewhere and then you need to do kind of manual add the switch to two or three monitoring products that you have which is a tedious work, nobody likes that people when they're busy they usually forget that and then you end up in a situation where you have something in production but it's not monitored and it kind of bites you up in the end and that's what we're trying to fix with and that's what we fixed and we'll show in the demo where the tool actually the tool actually connects with the network monitoring when the moment you add a new device the moment you deploy a new infrastructure it actually shows up in the network monitoring and by first if you remove it you don't have any more steps that we need to do is the Event Driven Automation stuff that will help us connect all those three together our platform our InnoAnsible Tower and our network monitoring to make sure when there's an event how can we make this event actionable how can we use the tool and the standard infrastructure to actually self heal or do something smart and the continuing making sure the full infrastructure is standardized something that is probably going to take a lot of time currently as we show on the statistics page we have we have gaps we only have more or less newly built infrastructure fully managed and that actually takes time so now for example when we are actually going to proceed with this we can do the self healing only in a limited amount of fraction of the infrastructure and that's something that will probably take quite some time some more stuff that need to die and with the adoption and constantly using the tool to actually develop and deploy new stuff I think that number will just go up and I've talked fast so Marc go ahead So I will show you the demo and especially the demotopology you will see it's basically simulating DC like environment with every protocols we use there it's running in a lab just to be able to deploy it from scratch but it's more or less the same two devices we are not touching are the core devices at the top of the picture because it's already existing environment and they are serving more than just our lab so we are not able to zeroise those and re-deploy those because we would cause an outage with that but all the stuff bottom from those is in this demo completely zeroised it's like a box you would take from or unpack from the box and just wreck it, connect it and then then you would deploy it so what we have here is MPLS and EVPN mixed infrastructure where we are using EVPN for layer 2 and layer 3 routing for VLANs extending the VLAN so we have we have no L2 spanning 3 domain or less it's all routed all the links you can see are routed and for that we use VGP router reflectors on the core and all the all the VYCs are connected there in this case we are also using stateless firewalls more or less sorry, more or less for blacklisting they are connected via MPLS we understand all the VRS from the distribution layer up to the firewalls and then use the logic there on the core we don't have any of those VRS that we have on the distribution layer because they are connected just to the underlay infrastructure so I will show you the demo recording it's quite a lot of deployment and I wanted to be sure that it works as it should so we have that record let me just stop it right here and show you basically you can see that devices are in amnesiac mode I'm showing that on the console going through all the devices and all of those are amnesiac that's a channel promote when you unbox it it's basically empty switch yes, without any configuration I don't know if this is visible at the back it's recording I will show that after the recording because I have the this is basically a static file so I will show that and go through the through the file with a bigger phone I didn't realize when recording that it might not be visible for guys in the back row so what's this it's a YAML abstraction layer for us that we use it helps us to define relationship between devices on its own within the campus with abstracting multiple things like as I mentioned MPLS we just set that flag to the true and then we have all the standards in the background that they are basically doing the connection for us relationship between devices we basically say how it's connected to the core layer we define all the view lines and stuff like that and this is basically just scrolling down to the true devices you can see that it's a lot of work in the upfront before deploying the device but still it's not that difficult like creating whole configuration before you deploy that it's a a lot of abstraction or abstract one thing I would like to highlight here let me just move that the other side or I will show it here so it's better visible so what has been shown MPLS EVPN we set that flag to the true we set telemetry to get the monitoring up and running this is nice visible how abstract it is we basically define BGP domain but we don't define any peers any neighbors because the factor behind and our logic behind just finds other BGP peers within the campus and defines all the IP addresses and how they are connected together it figures all the IP addresses because it has all the facts so we just define this is client for this router reflector domain on the core we have a router reflector server and that's it we are heavily using includes within the campus files because you can imagine AML file where you are defining ports for device and it can be a stack of 10 members 48 ports it would be 480 definitions for each port and can grow pretty fast so we are heavily using includes in include you define more or less the VLAN with their specific VRF you can see that we are not defining even though we are using EVPN we are not defining any VRRP IP address or something like that again it happens on the background because we have a standard for that and I wanted to show you how the including can actually help you simplify the work let's say we have two firewalls in the picture and the firewalls should be defined the same so policy should be the same filters should be the same VRF tunnels because for loudly we are using tunnels and VRFs all that thing should be the same on both firewalls and I can put it to one include file and just include it on the two devices in the network or more of them and when I am adding or removing ACL or whatever I do it on multiple devices at the same time that's how including helps and now let me just skipping too fast so yeah let's just continue in the demo just scrolling through the same file now we will move to the uncivil tower itself I will stop it right there so in the uncivil tower for this use case for the demo I created a workflow template something similar that we are using elsewhere as well but specifically just two of those devices that you have seen on the topology the workflow template consists from two job templates it's basically deploy IT lab and then update monitoring and when we launch it so it takes a little bit of time for running it's deploying over the console which is kind of slow but I am going to show you on the one console because we are using as Sigur mentioned we are using open gear and when you connect there on the console interface you can see what other person is pushing it's basically using a screen for that console so I am going to show you part of that communication between uncivil tower and the device itself now we can see it's verifying first it verifies the serial number so it knows that we are pushing to the correct device because the cabling can be messed up and you don't want to reconfigure different device on the network and then it moves to the deployment and it basically without us touching anything it pushes all the configuration there and of course I am not showing you full configuration because there are issues of passwords and stuff like that I will move just to different screen in a few so you can see that it is actually pushing something that we are not yeah just to refresh the uncivil tower job to show you how it looks like there it's basically just set of tasks that are being executed some of those for checking serial number opening SSH tunnels to the console server so here you can see that we are already getting a diff now in a few it should just appear green that the job was executed properly and successfully and with that we will have all the six devices deployed and then we are moving to updating monitoring based on that and as you seen in the campus file we just set one attribute to true and that's it for the monitoring it figures all the stuff again in background that's nice what you get if you go full and full automation and especially if you have a single source of truth as Sigurd mentioned you can then use it for multiple things and as the automation knows about all the details about the device it can pass it easily to the monitoring and monitoring then knows what exactly should be monitored according to our standard so all the point to point links between devices BGP if it's running there it should know what IP addresses to check if we are monitoring clean utilization to external links like VPLS one connectivity whatever else it knows about it this is a running campus and I'm showing here that there is no other campus in the monitoring because the BRQ lab is going to be deployed in a few so what you have seen is basically the testing environment another development environment we have and the you can see that there was a commit success so the configuration was committed then how it's done on chanipers and pushing the monitoring takes a little bit more time because it's going through the files and creating all the configuration files on the server so in meantime I'm connected to one of the devices and as you can see you have BGP established for more or less two minutes already exchanging exchanging BGP EVPN information L3 VPN for MPLS all the VRMs that we have in our there are appearing some of the point-to-point IP addresses and stuff like that and now you can see that we already have some MAC addresses on the VTAP shared by EVPN so that's more or less complicated or complex infrastructure running all these protocols running BGP EVPN and PLS together in one campus it takes a lot of time to deploy that manually and we have done that in less than six minutes in this case and here you can see that on the monitoring side it basically created six files six configuration files for chanipers and MP monitoring well it's called chanipers and MP slowly moving to GRPC which is faster and you can go down below one minute interval when getting metrics from network devices and even more than SNMP offers to be even faster when recognizing an issue and now it's pushing also some other configuration or more or less validating it's there but in a few we will just restart the monitoring service and I think I post the recording for a little bit because it takes I don't know one or two minutes to get some information in the monitoring itself and then on the right side as you can see the system info it takes we collect that every 10 minutes so you can see it appears there it's just let's wait for the job to finish usually when you do automation like this you just execute the job and you can go drink coffee smoke based on the preference it takes a little bit of time especially restarting the monitoring service I'm showing here that other devices were deployed as well this one had a stack console it happens time to time that you need to log out when using conceivable tower when we were using CLI 2 it didn't happen so that's something that we need to figure as well so I just unstuck it and just log out from the console and it was basically immediately in the working state well it was in working state before just the console access was stuck in a few the campus should appear in the monitoring let's wait for monitoring to notice it it's the thing in this demo we are using SNMP we have one minute intervals for the SNMP it takes around one minute for monitoring to notice that and you can see that we have already some devices appearing in the monitoring and when we pick one you can see that it already picked some information still waiting for something like interface utilization interface errors but we have at least the links information so we know about the uplink we know about the state we know about the CPU memory we know about the GP neighbors that both neighbors are established in established state we have also a diagram of how it looked like for last hour and now I will move to the and so so I can show you that it's collecting and now you can see that it collected also the system information including model latency software version and uptime NFU it will collect also the hook's name and all that information about the device so that's more or less how we deployed DC like environment consisting of six devices in how much was that in more or less 10 minutes 11 minutes it's your time to can you change yeah so Q&A guys any questions how does this actually say or rectified a real network problem so far well meaning in a self healing way in an automated way not yet that's the part where when we were actually writing the what was the the abstract for this presentation we were hoping to be there for this presentation but we are missing the last piece when I said we need to connect the three parts which is always something that we want to do so that's we have not done that but we are very close in order to connect the dots as you've seen we have the network monitoring we have the Ansible tower we just need to connect the dots let's see what are the types where we need to actually do that and kind of test it so far none but it has helped us a lot when we actually have a problem or where we actually manually using it in an automated way to fix problems like we need to somewhere on the large infrastructure we can easily do that now we are actually using that but not in a self healing way we are missing the last step to actually connect the dots we have the infrastructure we have the monitor we have the Ansible tower we need to connect that stuff together so you'll just run some or you know, connect the network somehow to test it we won't try it in production as Martin has we have a lab and we have environments where we can test that we brought the Ansible tower up and running, we have the network monitoring we just need to start alerting and then use that to make some logic we have a lot of labs within Red Hat so it won't be a problem to start testing that if we even have our lab as an IT we have labs where we can play and test scenarios like this and that's probably where automation and self healing is scary because you really can break stuff so that's stuff that we really would like to test some of the stuff and before putting them in production any other questions? well, we are a team of I don't know 15? so I would say some people that are so the active developers Martin the actual folks that know a lot more about it we have probably I would say 2-3 folks and a lot of folks actually using it and trying to help and implement stuff and that's part of the we started with one person and we are slowly growing that people that are not just pure network engineers but they need to know because this tool will fall apart if people are not using it or not updating for example we implement a new vendor or a new platform or whatever we need to figure out how that will plug in so that we can continue to work so the ratio is not great but I think the adoption is good and people are actually have enough skill set to actually find problems and fix it by themselves I would say if I may add when we were showing this slide so the adoption is pretty good because on the journey per side we did 20 or 30% last 20 or 30% that you can see there in last 2 months or so so again the adoption is great and that's the critical part that I mentioned that for people using or trying to do this you really need managers to actually try to convince people this is why and again you are dealing with people people are scared but they need to do something new and it's kind of like oh my god I'm going to get out of a job no you're not going to get out of a job but this is a tool where it can help your job to be easier help us as a team to actually deliver more and faster as any IT we get more and more infrastructure we don't get more and more people it's not scalable so we need to figure out ways to do it faster and smarter and again you are upgrading your skills to actually more or less like SDN and codifying and stuff like that so that's the part of the journey where we were actually changing the mindset we use the standard BGP in Juniper we don't use the so it's not running on ground it's just running on the switches this is a physical effort infrastructure more or less so this is the these are physical devices by Juniper C-score or whatever and we're just abstracting the configuration and making sure we are templating and standardizing the configuration so we're using the normal way, a normal network engineer will configure a device we're just typing commands, we're just doing it in an automated and abstracted way description files yeah, that's I think the question for me so more or less what we are using on the background is as you could see the other files that are providing us with this abstraction layer and giving the facts about the infrastructure from network engineers to the tool then we have the so called factor which is taking those files and extending those facts with some other logic so as I mentioned fighting other peers within the BGP router reflector group creating or adding point to point IP addresses because we have all that standardized so everything we have in our head how to create those standards, how to do that that's now moved to the factor that is taking those facts extending them sorry, that is then taken by dynamic inventory of Ansible and Ansible is then taking those facts and creating plaintext configuration files using ginger to templates that we created and those configurations are then pushed to the device what we are doing with every round we are basically replacing whole configuration to make sure that the configuration is standardized nothing that was manually added is there and it's according to the source of the group what do we... so you sure that you use Dacana for... for what, sorry? for alerting for alerting, okay you want to answer that? so for alerting I mean for the creating events we use capacitors which is consuming stream data from telegraph which is collecting those metrics it's on the one side it's creating events based on some thresholds but it's also pushing back to the database all the SLISLO statistics that we have defined so that we then can visualize those things for the management and business partners to basically show them how we are standing then those alerts are then or those events are then pushed to the event management which is more or less patient duty for us at this moment and then basically alerting us well what we showed on the demo was it's completely empty like you're unpacking it all we need is a local hands to connect it to console and then we push everything via console we actually have the actual tool allows us to use multiple ways for example if we don't have console access for whatever reason someone will just type in the management IP address and via management we have an option to do the same but the purpose is for exactly that empty box we give console and we push everything so new data center we have 20 switches they rack it they connect console we push now and more or less we have everything up and running well you know which which switch should be connected to which port on the out of band management I would just give that an example you can imagine that you are setting whole new data center and you have network engineer who knows what platforms and what devices were chosen for that project and he says what he defines in the demo file this is how it should be connected together without taking care of serial numbers and that missing piece but he defines all the stuff there then the tool of course has capability of exporting wiring map so you can give that wiring map to your local hands in the data center then they go, rack all the stuff connected how you define that before they can fill into the two all the serial numbers because they know about those and they can execute the job on their own because it's just matter of someone executing it because it was defined before and the network engineer then can consume that infrastructure when it's fully deployed it does the dynamic inventory it discovers all the compasses it calls the factor on all those compasses and it consumes those facts more or less so the dynamic inventory just calls the factor which sits somewhere else and consumes all the facts at the end in that case I will just add to the last question if I may with having all the facts you have one huge advantage for having those facts in inventory as host wires you see an opportunity to use smart inventories in Ansible Tower and basically call the job devices I don't know let's say we have security incident and you need to fast implement some ACL to block that you can say Ansible Tower just execute this job on all the edge firewalls in our network and it goes based on the facts it knows what those firewalls are and just updates it thank you guys