 Hi everyone and welcome to this talk called the really crazy container troubleshooting stories My name is Gianluca. I'm a core developer of the open source system troubleshooting tool And today I want to share with you a couple of container troubleshooting stories that all happened to me over the past year or so And they all revolve around the performance monitoring and stability of containers As you'll see I'm gonna use many observability tools for troubleshooting So hopefully this will add some value to you know to your tool belt of tools that you use daily and These stories might be a little bit controversial But I cherry-picked them especially for this occasion because they will remind us that User space and low-level system components are very very very important Even if in the container ecosystem we keep building abstractions on top of what we already have The first story I have is called the container isolation gone wrong And it's a story that is meant to show you a little bit. What is the real cost and disadvantages of? lightweight virtualization Which is the one that is used by containers obviously and so I want to start I'll go rather quick over here because we lost already five minutes But I want to start with essentially the good parts of isolation as you'll know if you run two containers By default each container is wrapped into its own a peed Mount network namespaces and things like that So there's a pretty good level of isolation by default and then you can also leverage You know the various parameters of the C group for example if you the Cp accounting C group over here You can essentially specify the maximum amount of CPU that one container can consume So that then if you run two two containers that tries to stress the CPU One will not be able to use more than 10 percent of the CPU, right? This is pretty simple and at the same time you can also do it from the memory point of view So if you leverage the memory C group you are able to make a container not consume more than X amount of memory And you can see over here in the chart that if I try to do Linear memory allocation inside the container once it reaches the 512 megabyte, which is the same parameter that I use for this group the The process inside the container essentially gets killed And this is all well. So this stuff is widely supported by the kernel is Incredibly stable and is really used as a basic building block for every container runtime Although a while ago something happened. We had essentially a Customer that's called the way a sysdig who came to us. It says we are using We are using Kubernetes to orchestrate heavily containerized application and We have different sort of containers all running inside the cluster and when Kubernetes decide the pseudo randomly To allocate the same two containers on the same machine So two different kind of containers on the same machine the performance will suffer So the performance of these two applications will suffer So we said well the first thing is are you actually limiting the resources that this container can consume and this customer said Yes, we are limiting them by limiting using the CPU limits memory limits actually even block IO limits, I believe so This problem was at the beginning quite interesting. So we started to say okay what these two applications do what are these two applications that are containerized do and We started studying their behavior and the two applications that I have are these ones One is called worker and one is called trash air the worker is essentially a simple watchdog application that doesn't do anything more than every few seconds scanning a directory for files and Reacting if the metadata of this file change So very simple and it also reports the duration of these watchdog check as a stats Demetric so that we can inspect it and you know centralizes and alert on it And then I have another another container which is called the the trash air application and this one does a little bit of Different things, but for the most part it does some asynchronous processing of files. So very very simple And so again the problem is if these two containers get scheduled on the same host by Kubernetes Even if they are completely resource constrained The performance will suffer so the first thing to do is What do we mean when we say that the performance actually suffer so to get a higher level understanding of the problem We can just chart the metric that the worker exports which Essentially measures the duration of the of the of the watchdog loop and you can see it charted over here And you can see how the pattern is Very very predictable the watchdog always takes 250 milliseconds to execute very very very predictable with extremely low variance And this is when the worker runs by itself on one host when Kubernetes decides Randomly to schedule the other container on the same host. I get this behavior So the performance of the worker application Linearly decrease up until they reach 600 milliseconds per loop. So this is more than 2x the performance overhead And notice how the performance it is exactly correlated with the life cycle of the Trash air container could these lines over here signals when the trash air container is started And these lines over here signals when the trash air containers is stopped. So you can see that the second I stopped the other container Everything goes back to normal. I go back to the very predictable 250 milliseconds. So this is a quite interesting story and At this point, you know, I was trying to do a little bit more of trouble shooting And what I like to do is well I like to get a little bit of an x-ray of my applications running when I have to do some A little bit more advanced trouble shooting. So I like to use the system call interface and This was yeah, I'm gonna go over this but essentially this just shows that this container will actually consuming very little resources from CPU memory and IO point of view. I don't have unfortunately time to focus too much on this so At the system call level, I am able to get a pretty good idea of what Our process in Linux is doing even without knowing how to read it source code or even if I don't have it source code, right? And this is a quote that I really like from yesterday's man page that says that in some cases the stress output has proven to be more readable than the source and This resonates really really well with my experience So I'm gonna start by getting system call analysis of my two applications and I could use a stress for for for this trouble shooting I'm actually going to use this dig because I'm a core developer of it And it's a little bit more container friendly, but for the most part any system calls nifer will do So I'm gonna start in this case by using my system calls nifer Specifying as a filter container name equal worker and this is a pretty good snippet of the activity of the container So you can see as I said worker is a watchdog process that opens a bunch of directory So you can see that the worker starts by opening a directory in this case lash linux lash dot get so I get repository And it gets back a file descriptor which is essentially the file descriptor of the directory And then it uses this file descriptor of the directory to another system call They get the ends so they get directory entries that we returned the list of children of this directory so all the sub files and sub directory at the immediate next level and For each element returned by the get directory entries The the worker process will use the L start system call for each child to get metadata information from the I know So it will get file type file size a bunch of times them and things like that That it will use for this processing and in fact you can see that it does the L start on head And then it does the L start on branches when the worker recognizes that the actual Child the actual directory entries is a directory itself. It just recurs In fact, you can see over here that then it uses immediately the open app Linux dot get branches, and this is just a typical, you know Full depth scan of a directory that you will do in any sort of programming language of scripting And then when it finishes it prints on screen the number of files processed 61,000 and their total size and again is not doing anything It's just reading statistic via the L start system call and you can see by comparing latencies Which is something you can do quite well from the system call point of view that there's 250 milliseconds of difference between These two numbers which exactly matches, you know the watchdog loop duration What about the trash or container the trash or container does a little bit of a similar activity Except that it opens a bunch of random files look at over here All the files that is open are slash TMP slash this random you ID and it does this a few times per second although If you can see over here none of these open at actually succeed all these open at we return an error You know and you know which is the kernel that says This file doesn't exist and it does this for again a few times per second then you can see that there is actually a context which a process context which and The process then slips for about a second and a half If you compared the two time stamps and then it tries again with a new set of random files again This logic is a little bit stupid the original application was doing something more interesting But for the purpose of this it's it's it's fine This this behavior is enough to actually reproduce the the whole the whole performance issue So now that I have the system call analysis what I can do is I can start to dig more into the question What part of my process is slowing down when these two containers are scheduled, right? It could be either user space in which case I don't see any Any any essentially system call latency increase or it could be from the system level so from the kernel level and so if I take a View of my system call analysis before the performance issue and during the performance issue and I compare all these numbers Then I am able to essentially see if this latency is centered around a specific set of system calls So I could do this by hand one of the reason why you see this is because I can script this analysis actually with simple Lua scripts and so I'm exactly gonna do that and Here you have the result of two scripts running that will output me on screen Where the delay is spent when the trash is not there and when the trash is there Now if you compare these two you can see that all these numbers look all quite similar except oops Except the L stat system call you can see how the L stat system call when the performance issue is there takes three seconds more to execute and If I compare these three seconds more to the actual duration of the system call Analysis that I did I get exactly those 300 milliseconds overhead that were over there in the chart so clearly The worker process is slowing down because the L stat system call itself is much lower. So this was interesting There is already there's also one more thing that I can essentially troubleshoot which is is this additional three seconds delay caused by One specific Directory access for example is this focus on one specific L stat invocation or is it evenly spread across all the 61,000 L stat system calls that the worker process does and To do this I'm gonna use another essentially Pool in the troubleshooting world which is essentially the spectrum So if I take all these latencies and I chart them by frequency I can get to something like this where I can see the frequency of my system calls Bucketed by their delay So each row in this chart is one second of activity of my process of my watchdog worker so I can see it's pretty impulsive right it's you know it does something then it goes to sleep for a few seconds and The red bands show us where the Latency is focused at so you can see that the whole during the whole duration the vast majority of the L stat system calls We're under one microsecond What if I compare these with the second case when the performance issues are there I? Get a very similar view although notice how this time all deals the vast Well all deals that system calls are much more skewed towards the 10 microseconds So look at here the difference one microsecond 10 microseconds Almost so we have almost a 10x overhead evenly spread across all the L stat system calls Very good At this point I kind of exhausted what I can what information I can gather from a system call analysis Because again the next question is why are the L stat system calls low right? Which part of the L stat system call is causing is causing this issue So I'm gonna switch I'm gonna switch as gonna switch troubleshooting tool And this time I'm gonna get into perf I'm sure most of you know perf but perf is essentially the official Kernel observability tool Shipped with the Linux kernel itself and is by far the tool with the highest number of arrows into each kernel subsystem Because you can do a lot of very cool things you can do performance counter dynamic tracing Dynamic probing Almost essentially live debugging you can do CPU profiling so in this instance I'm going to use perf for doing actual CPU profiling so I'm gonna profile the activity of the worker process while The overhead is happening and this will show me all the functions executed in the kernel space on behalf of my process And so this is how I invoke it. I invoke it with perf top By passing it the pit of the process in you know in the global pin name space And this will show me all the functions and you can see the overwhelmingly majority of the kernel time on behalf of the worker process It's pent on the on the D lookup function So the L status and call is calling this D lookup function. Now. What the hell is the D lookup function? I Can use perf once again if you give perf enough debugging symbol It will give you a lot of information So if you use actually perf probe on the D lookup symbol, it will actually print you The kernel function and this kernel function is quite simple, but I want to go over with you What it does so deals deal statistics and call every time you need to open a path It calls your statistics and call what happens in Linux is that if you if you if you have a generic path like slash DMP Slashful there's a bunch of crap that the kernel needs to do in order to ensure that this path is valid So I need to start, you know by the slash I node navigate the I node on disk and reach the directory entries Which are pointed by the I node and match each one until one name TMP is found then from the TMP Repeat it again with the TMP I node and go until you find one that is named foo This is a lot, you know of these caxices and stuff like that So it makes a lot of sense to cache this information And in fact the D lookup function is exactly a function that Taken the directory entry of a parent and taken a string name of a child It will look up in the directory entry cache if this path has already been before and the The directory entry cache itself is implemented just as a very big ash table. So classic ash table, you know with an array Essentially bucketizing items and then a list of all the directory entries hash into the same value so why Why the worker process is low during actually a cash lookup Would it be that since this is a hash table? this ash table is growing a lot and It's causing some sort of issue. So it for example increases the number, you know of hash table collisions And this causes a performance as seen in user space. I Can verify this with another troubleshooting tool and they can use this laptop Slap top is a very simple tool that prints me the size of the kernel cache So all the objects pools are located in the kernel caches And you can see that when the trasher is running the D entry cache the directory entry cache is 20 gigabytes in size With 102 million With 102 million directory entries, which is far bigger than the all the files that I actually have on disk You know this file system. I have I've never operated on a system with 100 million files And if I use it without trash here instead when the performance issue is not there when the other container is not there Did he enter catch it just 14 megabytes? So these 20 gigabytes are the ones that are causing the ash table average lookup because of collisions to become slower so how the hell is it possible that you know My pressure container is affecting so much this hash table and the reason is over here in this first In in this first analysis you can see that the the trasher process tries to open a bunch of files that don't exist But since the kernel is very smart It does some sort of negative caching so even if a file doesn't exist It saves into its the entry cache the information that it didn't exist so that next time I ask for it It will say no, I don't have it without doing any sort of these accesses unfortunately in this case Since the the entry cache is a single tone because it's not virtualized Since is a single tone the trasher process is able to Pollute it and as a byproduct of these The worker the worker container suffers because the worker container needs to access the entry cache And so it will cause a bunch of a bunch of issues. So two very important things Using all this memory for the entry cache is very good because this is idle memory This is memory that has not been reclaimed by any user space process So the kernel is putting is putting it to good use and issue number two This is not an IEO contention issue because these processes are just doing a few kilobytes of IEO per second, you know Trying to read some IEO doesn't really cost anything. So This is a data structure contention caused by the fact that the the entry cache is a single tone between All the containers in my system now To go to the conclusion the conclusion is a little bit controversial But it turns out that this customer was using a Distribution based on EL6 a very old kernel with containers which again the people in this audience will laugh about it, but I mean this really happened So that's why I brought this story because this really happened and it was like a very big customer or something That and it turns out that of course the EL6 kernel is not the best target platform for containers And in fact, even if a process is limited in a memory C group in a container In on an EL6 kernel on a 2.6.32. It can still cause enough Kernel allocations on kernel caches that far exceed the actual limit of the memory C group set This behavior was fixed a while ago So if use a modern kernel this doesn't happen the treasure container if was memory limited Wouldn't have been able to allocate 20 gigabyte of the entry cache objects, but this still pros a point You know the point that I was that I was gonna make is that On a recent kernel this issue will not happen but since not every resources virtualized There will always be maybe a global per subsystem kernel lock or a or a single tone data structure That could cause actual contentious between containers and it has to be because if every single resource was actually Isolated we wouldn't have like very virtualization and we would have another KVM right for containers so The gist of it is that you should probably And properly observe your application so that you know what performance you want to expect and things like that because Kubernetes never gonna be able you know to in a scheduling decision to determine. Oh, there's this one as too much Memory allocated in a kernel cache, so I'm gonna be much smarter in my scheduling decision So this is something that as an operator you have to handle This was the first story The second one is a little bit shorter and I'm gonna cut it quite heavily because time is running out And it's called in search of ff ff ff 600 400 This is a this is a this was a this was a typical Heisenberg That happened to me. So I think that I could never never ever ever replicate And so it started by us shipping an Application to a customer in containers This application was composed of several containers. We had a Java application Then we have a couple databases like we had my sequence and read this and this customer said, okay I deployed this application on Kubernetes but your containers are not starting and he sent me this chart of Essentially the metric Kubernetes pod restart count showing me that the red is container was Constantly restart you can see the Kubernetes is restarting with an exponential backoff delay Which means that the container wasn't starting itself And there were two interesting thing in this problem The first one is that there were no logs at all from Docker from Kubernetes from Redis itself And the second one is that among all the containers that could have failed the one that failed was Redis Which we used it very little for our application so we didn't customize it So we took an off-the-shelf Redis image pray from the Docker hub and as you can imagine this ran very very well on my machine, right? there's yeah, so again system call based analysis and This is a system called based analysis of the container that is not starting You can see that it starts with an exec v system call Which is just a system called that you used to you know execute a new program in Linux And the executable is bash as an argument and 3.0 to say So this is just a standard container and 3.0 that runs in pretty much every container and sets up a bunch of crap like environment variables and things like that Although five seconds the exec v over here from 1.3.3 to 1.38 you can see that the process exits with status 11 The status 11 means that the process crash, right? In fact right before you can see that a signal was delivered from the kernel to the bash process of Type signal segmentation fault, so the bash process did a I don't know a weird memory access that caught that caused it to crash And right before the signal deliver you can see a page fault event So there was indeed an invalid memory access a page fault event on these address fff 600 400 So bash try to access this weird memory address and what was even weird is that look at here The instruction pointer itself was fff f600 400 so not only the memory access was done by bash But was done as part of an instruction fetch so the bash process was trying to execute a function at these weird memory address, which is the very far end of the 64-bit address space and Again, I'm not super advanced in that area. So I was very surprised at the beginning because again bash is supposed to be a very stable process, right? I mean I've never seen it crash usually so I was very surprised. Okay, so System call analysis done. So at this point I need to go, you know to standard Developer, let's call it trouble shooting. So we asked the customer for a core dump of This and I opened it with gdb And with gdb I just use you know the standard BT to show me the stock trace of the bash process again It's not ready. This is the bash process at the time of the crash and you can see that indeed You know that the the the lowest stock frame It's fffc standard 400. So indeed bash was trying to call a function at that memory address But if you if you walk up the stack you can see that time was in the stack trace and Time is a very simple system call that shouldn't do anything other than you know Return the current timestamp. So why exactly bash decided to call? This other weird address instead of just natively sending a system calling vocation and Gdb is kind enough to point to me to the actual First code where this time invocation is done right from the libc. So I decided to Obviously go and open The can of worms and go look at the time dot s File inside the libc source code at that version and this is a bunch of assembly that shows you how the vc Skull time is called but what was interesting is that the actual address fffc standard 400 It's actually defined at the very top of the file with his name. This is called other v time So this was enough to give me enough material to go and Google You know research what the hell a vc skull is and it turns out that the vc skull is an optimization relic from the past So as you know whenever you have to invoke a system call there's There's a protocol, right? So you have to write all the all the arguments in you know in the CPU registers and then you have to send a special a special instruction to the CPU so that is which is to kernel mode and the system call is executed on the process behalf There's not that this is much much slower right then and then a native system then a native function invocation where you just have to Advance the instruction pointer and fix the stack so The current developers said okay There are a few system calls that are very very simple like time get time of the day get CPU and something like that And it's quite a waste to have them Switch to kernel mode just to execute very little and then return And so the current developers a long time ago implemented this vc skull mechanism where they said okay We are gonna map into every process in linux We are gonna map the native code for a few system calling vocation so that if user space wants user space can just Execute them straight from this address natively without switching to the kernel and the time system call was indeed hard-coded are coded at this address and And then of course, you know the lip see adapted. That's why in the lip see code. You can see this thing But over time it turns out that this is called mechanism It's quite you know It's not very secure because mapping native instructions to a predictable add to a predictable location in every single address space is Not the most secure thing and so currently developers said okay Let's crash that and let's actually implement this at the VDSO Which is the current state of the art and for the purpose of the discussion the VDSO is exactly like the vc skull Except that the native code is not mapped at a fixed location in the other space But is randomly mapped, you know as as you will normally do you know with a memory mapped file or things like that so the And then of course the lip see adapted and say okay now we are gonna start using the lip see sorry the VDSO and It turns out that Over time this is called was more and more and more deprecated To the point where if you use a motor kernel the vc skull doesn't really work Well, and you enter in one of these two cases if you try to execute from most linear distribution nowadays From the vc skull page you can still execute it But you will get trapped into the kernel and the vc skull will get emulated for you So there's really not like native machine code or if you use most secure distributions the vc skull is actually disabled And if you try to execute from the vc skull page your process will be terminated and this is not a problem Right because if you use a modern distribution whoever package the distribution for you Obviously took care of shipping a modern lip see right But you can already see how this is going since I'm using containers and the lip see is not a is not a dependence of the linear Distribution, but at the base container image. It turns out that this customer was using a very modern Debian kernel that enabled the vc skull none around last year around exactly one year ago And they were kind enough to say this breaks g lip see to the 13 and earlier Which is fine because again if you use you know what is modern they be a kernel you're gonna use this modern lip see that comes with Debian and Can you imagine what the lip see version was in my ready container to the 13 And this is an image straight from the Docker app They just a little bit a bit older version of red is but you know quite stable and you know battle ready So just because I was using the red is container Based on an old base image. I got screwed and of course in my system It was working because my system which is running as lightly older Debian was Still using the emulated vc skull mode Now Even if I didn't experience the crash remembered though the de-emulated vc skull mode as I was saying is an emulation So I then did a little a little Benchmark and look at this if I'm using the time system call from a lip see version that uses the video So I can do 260 million events per second super fast This is native function calling vocation incredibly fast. Whereas if I use the vc skull emulated mode Which again cause a trap into the current something like that. I can just do one million events per second So there's a massive difference in throughput. Of course with red is I'm not gonna suffer this much overhead Right because it's not that my read this process just Spinning in a loop doing doing time, but you can get the idea and again The idea for this is that I want to just which might be completely obvious you know to this audience because it's a very advanced technical audience, but user space really needs to be taken care of and needs to be loved On your base distribution as well as on your base images So even if you download a docker image straight from the docker hub take care You know to just inspect the docker file and if you see you know that is based on Debbie and wheezy Say whoa, okay, maybe maybe I need to change it And this is and that's what happened and again. This is an issue that really really happened to me And this is all I had if you later take a look at the slides I have a little bit of additional or more content when I actually use perf in the first in the first exercise to actually Measure how many hash collisions where per pair individual D look up invocation as well as I did a few experiments when I was actually looking at how the Emulated system call mode works and things like that. So this was quite useful from the point of view, you know off of understanding the system especially because I'm not I'm not an expert in those areas and Think we are a couple minutes earlier. So if there's a couple question, I'll be happy to take them Otherwise, thank you very much There's a question somewhere. I cannot where is it? Oh No, no There's a question there. I did not hear anything. Unfortunately how Which one which one how did I fix? This one this last one Well, we just Essentially recreated the docker image of the redis container without with a much newer base image I think without pine and everything was working fine because it was a mistake on our site You know to the big to begin with because we saw this on the docker. We said, you know, it must be good because it's you know It's it's you know, it's the official one. It was not one that you just do with, you know, well, you know, it's You live and learn so that's definitely something we learned Another question, yeah You had a one slide about the performance of the video so against the native Well, do you have an idea why there is such a big difference between the two? Oh, yeah Yeah, so you mean this one, right between the first and the second call you're saying why the native one is not doing 266 million Yeah, that's because the native one is not implemented with native machine instruction itself But if you look at here the native one if you go actually with gdb and You and you print the content, you know with this as you print the content of the the business called page You can see that this actually is just a standard system calling vocation. There's a move, you know the system call number into ups into the Into the AX register and then there's a system calling vocation Whereas the VDSO it's really native code right there. So this still cause a trap into the kernel It's faster than the other one that is emulated because the other one Uses an emulation via page fault and then the page fault itself as a big switch in the kernel that emulates this But this is the reason cool. Thank you very much again