 Hello and welcome everybody. I appreciate it that you're coming coming out for this talk here Given that there are other great talks happening at the same time really happy to see all of you So, yeah, welcome to visualizing network traffic of Bosch releases Subtitle here is where I'm promising in tuition engineering for everyone A few words about myself. Hi. I'm Marco for all of you who don't know me I'm a developer at SAP and I'm the PM of the Bosch open stack CPI there I'm working with Bosch and CF since sometime in 2014. So getting some some good exposure about the whole topic and My my nick is felt more pretty much everywhere. So on github and Twitter Clouds on the slack if you're trying to reach out after the talk, please do Let's start with the boring stuff. Here's a disclaimer what you're about to see So this is in the SAP track, but this is not the product This happened in my 10% time at SAP where we get to play around and tinker with pretty much everything that we think is a good idea or we want to try out and All of this is open source. I have the links in the presentation afterwards. You can try it out But please use at your own risk Right now that the mood is at its peak Let's talk about intuition engineering. So who of you has heard the term before? Not an awful lot of people great. So I Heard that term first in when I encountered a Netflix blog post in 2015 and I felt like this term like Catch my attention because I felt like intuition and engineering aren't those like antonyms Is is this supposed to go together? I like we we proud ourselves to be engineers, right? We are we are measuring stuff. We are based on facts and figures We don't rely on our gut feeling or an intuition. This is what we are proud of so typically that means our Workplace looks similar to this right. We got graphs. We got Measurements we get stuff going on because this is what we use to to to analyze things and All of this is great, right? We we need all of this But I think we often make the mistake thinking that that this is the whole truth. This is all we need this is what we need to to to understand things and intuition engineering starts from there and Has a few simple ideas like first if anything is best represented numerically Then we don't need to visualize it because visualization is for humans I don't need a fancy graph if A machine can do a job for me, right? There is no point in ruining a perfectly good system by adding a human in the loop who looks at graphs and dust stuff so to to start with a quote from that blog post is That anything that we can wrap in alerts or some threshold boundaries should kick off some automated process and Given that We can think about like what do we do with the rest? Right, maybe there are things that are too complicated or it's even impossible to create a heuristic or an alert threshold Maybe we don't know no We don't know enough about it yet. Maybe we don't even know what what it actually is Right, so for all these kind of things we do need a human in the loop And we do want some visualization to be happening and and that's one part of intuition engineering, right? The second thing is this is what we do right? We build measure learn we automate all the things because we are smart people and and we are doing the right thing, right? aren't we and My point here is that automation Reduces the impact of human error for that specific task because I don't have to type manual commands anymore I can rely on a machine doing certain things, but automation does not mean that we don't need to understand Why stuff is happening automatically what? Automation is going to do in a certain situation and I think this is the second point where intuition engineering can help us So here's three points. I'm trying to make in this presentation like first measuring and alerting That's all great, and I think all of us spend lots and lots of time Into that and and we do need it, but it's not enough we do need intuition engineering for two kinds of things first understand our automation better in which cases it behaves like what and second to even like understand or learn about cases that we didn't think about even measuring or alerting about and I'm trying to point out the way how to get there with some tools and some ways to operate And this is what I'm trying to achieve in this talk before we jump right into Actual tooling and how to get there. I want to start with this man His name is Baron seeker. He is a systems engineer at the University of Bielefeld in Germany and His field of research is formal verification of safety critical systems. So that's quite a mouthful In reality, that means he does research and consulting for aviation systems So who of you took the plane to get here most of all I guess so and did you feel safe? Did you feel like the pilot knew what she was doing and like every situation? I Certainly did so. I don't want to ruin your feelings for your flight back. So What what burnt? Did last year is he gave an interesting talk at the at the end of last year at The German like a conference based in Germany the talk is in English and linked here so she so you can actually watch it I highly recommended that's called the role of automation dependency in aviation accidents So the interesting part of that is Whenever something bad happens with an airplane People like him get called to figure out like what it was that led to that accident and what could have been done to prevent that accident from happening and one of the sources that he finds regularly is That pilots heavily trained pilots they have to undergo like Many hours in a simulator many many hours in that specific airplane that they are Piloting in They make the wrong assumptions about automation systems and their behavior in certain situations And this is when bad stuff happens Right and and that talk is quite fascinating on its own So I very naively thought of an autopilot always like it's a press of a button And then the airplane like does its thing and then you land and then that's it turns out it's a Big set of different Features and different automation things one of them is automated thrust control Which makes the the the decision like in an automated way should the airplane be accelerating or breaking? and one of the examples that he is giving is that Automated thrust control is implemented slightly differently in the two big aircraft manufacturers Which is quite interesting if you happen to for example switch from an airbus to a Boeing or vice versa, right? so and Rumor has it and he mentions that in his talk that there was one phrase that every pilot has uttered at least once in his career And that is what's it doing now? Right so when unexpected things happen My point is that you need You need an intuitive understanding to react fast and appropriately I mean those people are dealing with with human lives and most of us are very lucky that our software systems are quite important But not that important So I got two questions for you and those two questions I also asked myself like do I spend as many hours with my distributed system as pilots spend in their airplanes even as preparation and Is my distributed system as thoroughly tested as software in an airplane meaning like formally Verified most likely no and no right, but let's say even if it would be yes and yes My point here is that his research shows that this is not enough bad things happen Because you make the wrong assumptions about how automation behaves in certain cases all right, so how do we get a better understanding of how automation behaves in certain cases and This is the point I'm trying to make here. We need a tool to help us with that So we need to understand our complex system get an intuition about it and we need therefore we need to set up Allowing us to practice practice practice reproduce certain situations and so Netflix Showed in 2015 as a glimpse of a tool. They had internally called flux back then This is two pictures of a blog post taken from them. I'm going to explain in a second What what they are about so in the upper left corner? You see a broad overview like the middle circle represents the internet and The other three circles represent different availability zones where Netflix is installed in and The dots between those things represent traffic So this is from a situation where it happened that one of the availability zones started to fail This is the lower one, right and you see more dots connected to the other two Az's and less dots connected to to the lower one and you see a few dots Flowing like from the lower AZ to the other Az's that means they migrated customers During the failure from the the failing AZ and move them over to the other ones so Imagine you're in that situation. You realize oh one of Amazon's Az's is failing and your expectation is alright We are fine because our automation will act in a certain way It will move all customers from the existing AZ like from the failing AZ to the other two And you can immediately have a look at a qualitative assessment of your network traffic You're not actually interested in how many percent of your customers are still in that AZ and these kind of things You want to have a confirmation that automation behaves in the way that you expect it to behave and In the other picture you can see a drill down and one of the AZ's and see like how the individual Microservices are communicating with each other and you can do like a similar assessment on an like drill down level on that AZ Then last year they finally open sourced it and this was when I thought like I want to have something for boss releases that is somehow comparable to whatever Netflix is having here and this is when I started to to build up the stack I Knew that I needed a network monitoring agent on all of the VMs that I'm installing that is packet beat here I'm using a Bosch add-on so we don't need any modifications to the boss releases. We are trying to monitor We need a database to actually store all the monitoring results. That is some elastic search in this case You also get from elastic a nice tool to query that database if you're interested in more details That is Kibana in this case and of course we have visceral this is a tool you just saw before on the slide from Netflix to visualize it in real time I know showing you like two examples one is a very small one one is a bigger one to See maybe the benefit and how it behaves in the real world so the small example was like Almost the smallest one I could came up with right. I I built a Very basic Bosch release that only provides a single HTTP endpoint Called ping and guess what it does it returns a pop so I installed three nodes of that release and They are pinging each other I'm using Bosch links to connect all of these three nodes So I don't have to deal with individual IP addresses here and I thought well I can Most likely predict what the network graph is going to look like and and this is what I came up with So I got three ping apps here in a triangle They are talking to each other. That's what I designed them to do I got the director that the Bosch director in the middle and all of them are also talking to the Bosch director because They have a Bosch agent on them sending heartbeats to the director regularity Okay, and all of this happens on open stack and we are building the open stack CPI So I knew that all open stack VMs talk to the open stack metadata service in our case. All right sounds fair however When I looked at the actual graph at my first moment of what's it doing now because even in that very small example It looked like this Can you actually see the small lines between them? I hope so Okay, so we got our triangle of of ping apps over there so far so good We got the director in the middle all of them are talking to the director. That's great We got the metadata service in the lower left corner. I also predicted that and then we have Three nodes. I'm not so sure about there is one in the lower right corner and two ones on the in the upper right So that's when I when I started to scratch my head and really tried to look into it in With with Kibana and I thought like okay Let's query for those IP addresses right and see what kind of traffic we have and we have like every 15 minutes We have requests which are fully look like DNS requests So packet beat cannot only record raw TCP flows But also has some knowledge about individual protocols DNS is one of them and as you can see like on the Rightmost column it queries for time hosts something Something something dot SAP so this is our internal NTP server, right? I I remember configuring that that makes sense and I know that the Bosch stem cell actually does every 15 minutes Sync its date using NTP date All right, that's fair. So let's get back to this that means in the lower right corner. That's a DNS server and In the upper right corner this other two NTP servers. All right, okay So that means like even in the most ridiculous example, right? Three nodes just a single HTTP endpoint. I found like a few surprises at least to me To show like not everything's intuitively clear even if you don't have failures like even in the regular operation mode I found a few things that were interesting All right, that's a small thing. We don't care about small things. I Installed CF deployment, which is like the soon-to-be standard way to install CF And it comes like this is an entire cloud found reinstallation, right? It comes with two and for some instances even three AZs. I installed a single application and just curled it and Wanted to see what's happening, but to make it more interesting. Let's add some turbulence So if you don't know the turbulence release done by Dimitri the PM of Bosch, you should totally check it out It's basically Bosch's chaos monkey you can kill individual nodes you can shut down entire AZs you can fill up disks you can block network traffic you can do anything you like and My intention was to shut down an entire availability zone and see what happens, right? I got a few assumptions right when an entire AZ fails. That shouldn't be an issue I can still access my application. I can still work with the API of cloud foundry I can still push I can still ask for the status and eventually Bosch will repair everything, right? It will it will bring it back and all as well afterwards so I did kill an entire AZ and I had my second moment of What's happening now or what's it doing now because I learned that in CF deployment They are using my sequel and my sequel is installed as a Singleton in in AZ one so when I shut down AZ one entirely The answers were I can still access my app. Nope. I can work with the API. Nope Did Bosch bring it back? Yeah, and then it worked Okay, so all right, let's be fair and not shut down actually my sequel But let's pretend in this in in in this case. We're using some kind of cluster DB that RES manages or whatever All right, I'm scared of live demos. So here's a video for you. I recorded it up front Before we start this is what cloud foundry looks like when you record the network traffic It's hard to make sense of all of it, but let's let's try it together So Here is my my jumpbox from which I'm on the on the left corner from which I'm doing the requests I got huge requests going on to the Diego cells And and to the routers and then on the right-hand side That's the cell on the right-hand side the IP address. That's my application. So that makes sense, right? I got a huge flow of of traffic from the jumpbox where I'm using curl to the application There's also lots of other things going on with at CD who's a primary source of communication as well in CF release and a few other things happening so I'm moving the at CD note up there and Then trying to kill a few things All right So here that's turbulence and that's the JSON description of stuff I'm going to shut down and you can see like everything is in Z1 and it's executing a kill task right there So Bosch instances is showing or starting to show a few instances as failing And as we get back to the actual Visorization or hardware You can see a few nodes disappearing And we can probably see a few other things that we Can even see better when we turn on network filtering that we will do in a second API notes still getting lots of traffic. So the curl still works. You can still see network traffic going to my application Right all right Visceral because this is an awful lot of traffic happening has the ability to filter by network traffic Occurrence so to say like you can only filter out the stuff where lots of network traffic is filtering is Happening so let's do that for a second. You can see here two primary sources of traffic One is still routing to my app. The other one is interesting. This is two at CD notes Talking to one IP address there in the in the lower right corner, right? So I was interested very much in that and Was trying to figure out what happened because in the regular mode of operation You don't really see that happening right at CD behaves very nicely and doesn't talk to to any other nodes when you turn back the filtering like back to to Or include more and more TCP TCP flows you can see that other primary sources of network traffic are as expected Doppler so everybody is still logging files and so on So What is it that that at CD was actually doing there? when I fired up Kibana and Did a few queries for the network traffic for that IP address? I realized that this IP address I had seen before like in my in my previous very small example. That's my DNS server so both at CD nodes were firing like 60 DNS requests per second in order to find their third at CD node the one I shut down but Why is my infrastructure DNS server being asked for CF internal domains? Console supposed is supposed to do all of that and and I don't know if you attended Neema's and Adrienne's talk about why getting rid of console is a good idea. I didn't have the time to to look into that That high amount of DNS queries But I imagine that in this case for some reason console doesn't actually give the right answer in time Or even at all and that's why the next DNS server in my resolve conf which is my infrastructure DNS server is queried quite a lot here Fortunately the application was still working and alive So I could still work with it But this is still an interesting thing to happen and depending on what kind of infrastructure you're using how big your installation is and What kind of traffic you're expecting things like this can already be Quite a problem in your infrastructure Okay, so let's get back for a second to my assumptions that I had in the beginning turns out I can still access my app as long as my database is still running same applies for the API Bosch brings it back. It takes quite a long time For the resurrector to bring up the entire easy And all as well afterwards. That's cool. So All right as a summary I tried to show that measuring and alerting alone is not enough to really understand Your complex distributed system and that we need some kind of intuition engineering to really figure out What's going on either in failure cases or even like in regular cases to really understand what's going on Tuning and practice are like the most important things here. So you need some kind of chaos monkey You need to be able to do that in production in order to really learn and get the most out of it and Please rinse repeat rinse repeat do it All right, if you'd like to follow up all of this is on github Feel free to hit me up on Twitter slack or right here at the conference. I'm still around and That is pretty much it PS we are hiring Thanks any questions. We still got time Okay to understand the question I think what you're asking is if I try to Annotate the network links with some more metadata to to to get some more information about that I didn't check it out in detail. So visceral. I mean what we saw as as the visually as the visual Visualization thanks is actually just notes and edges the whole Application allows you to also display like a details panel about I don't know how many Requests were successful. How many of them failed or annotated with any metadata that you feel fit So you can you can of course do this. I didn't do that in detail other questions Is there a performance impact that prevents it from running in production? Sure So there is a performance impact depending on what kind of method you're using for capturing packages and there is a number of them Most of them don't use kernel primitives to record network traffic You can switch if you're installing a kernel module you can switch to Pf ring which is Using for example zero copy to actually Get a copy of the the packages so the overhead is practically zero. However, if you want to use peer freeing for zero copy mode That will cost you a quite amount of money. So in this case what I'm using here It is imposing quite some performance issues or like performance overhead That you need to consider So for us we are going to run this on our performance test landscapes to figure out what the actual overhead like is in terms of throughput and so on and Then make a few decisions So the question is that the reason why I'm seeing the DNS requests is because the application domain needs to be resolved I don't think so so when I looked into the actual DNS requests that happened it was the at CD cluster that tried to find its third node and Apparently its console Not appropriately responding with the right with the right answer or even responding at all So that's why the infrastructure DNS was created for the third at CD node that was actually down I think we got room for one more question. No questions. I'm still around feel free to reach out Thanks again