 I'm Kieran Bataram and I'm here to tell you about the tales of woe or the tales of the cursed operating systems textbook So I have a cursed operating systems textbook each chapter. I read honor. It's a new bug in the systems. I work on Lest you think this is another case of frequency illusion where once you've heard about something it keeps popping up again and again Let me assure you that it's not the problems this book on cup book of spells and covers are massive and unescapable They're like system-ending problems So for context while I've been using eunuch systems for most of my life My dad used to hand out a winter CDs at dinner parties He was a huge hit and I know theoretically how operating systems work I haven't spent that much time looking into the depth of like more serious and more involved kernels My understanding of operating systems is kind of that of the story of the blind men examining an elephant I know what the apis were and how it worked by observation, but there's probably a trunk here and a snake here I've never really seen the whole system or dug into its internals Basically, I wanted to be Lex Murphy. This is a eunuch system. I know this I Firmly believe that having antenna familiarity with the systems I work on will make me a better programmer I want to abstract things away so I can say that I'm aware of its guts But I'm not going to focus on it right now instead of throwing black boxes around things and not really knowing how they work on the inside So I guess if you're thinking about it on a more positive note, it's less a cursed textbook and more grimoire It's not a necronomicon. It's a book. It's a big book of spells and incantations about these systems of magic Sorry engineering that I work with and some of these spells might cause harm, but that's okay because you learn things along the way So the first chapter the memory leak that's coming from inside the kernel So credit goes to my cork or Nelson for debugging this issue So a while back stripes started seeing intermittent sadness with our internal DNS servers DNS queries would fail out and our servers would periodically out of memory kill processes Despite no application really using all that much memory. So as a hint, this is a little bit of a teaser for Kamal's talk later this afternoon So a note aside note about the oom killer. I wrote this 30 seconds ago. So my French is probably wrong But um, guess I say it's the job of the Linux out of memory killer to sacrifice one or more processes to free up memory in The system when everything else fails your systems out of memory of like you have nowhere to go So looking over the oom killers logs Nelson noted that a huge amount of memory was being used by the slab in the Linux kernel Slab refers to the kernel slab allocator which is used for internal allocations by the kernel itself So essentially all the boxes memory was being used by the kernel not user applications Which explained why the box was swapping itself to death and oom killing everything around even though no application was using all that much memory So our current state right now something is taking up all of our kernel memory and we're not really sure what It's taking up enough memory that the oom killer pops in every now and then and shuts off our nsd process How do we gather more information? So let's take a detour and talk about slash proc. It's a pseudo file system about process information So it doesn't contain real files, but a bunch of runtime system information about things like what devices you have mounted your hardware Config and importantly here about system memory So a lot of system utilities they might be using actually are calls to read from files in this directory So relevant to our case here is slash proc slash slab info We later found out that slab top wasn't was a utility that reads from this and presents prettier things, but That's what we had at the time So looking at slab top we found that there was something called a non VMA taking up a huge amount of memory Some googling discovered that there was a bug in the Linux kernels implementation of garbage collecting these Basically the way we reloaded our nsd process was by forking it and Then killing off the parent, but for each fork it retained An object of a non VMA Which so having each child become the new parents and the previous parent exit Resulted in an infinite stack of these objects so Not it wasn't garbage collected Armed with that knowledge we were able to confirm that doing a complete restart of these processes instead of doing the graceful reload We were doing caused all that memory to be released and that doing a thousand graceful reloads caused the memory to grow rapidly So as an overview we talked a little bit about kernel memory and user memory How to debug where your memory is going and the slash proc file system and along the way it got our service discovery back So the second chapter this is something that Julia worked on. I'm sorry for the pun I'm dedicating this to my co-worker Andreas here in the audience who have just started blaming for all my bad puns So the issue behind this section is that we were having After I read the networking chapter of my book Julia started debugging some slow networking issues with the system The just was that we were publishing messages to this demon on local host And it took 40 milliseconds each time this demon lives on local host. It's on the same machine You're just talking between processes. There's no reason publishing the local host should take 40 milliseconds. That is silly computers are fast So the HTTP library we used sends post requests in two small packets one for the headers and It it expects a knack and then one for the body and then it expects a knack So for efficiency you want to send full-sized TCP packets? And there's an algorithm called Nagel's algorithm that says if you have a few bytes to send but on a full packets Worth and you have some unacknowledged data in flight Then you wait until you have a full packet or until you time out and you get knack of all outstanding data So usually this is a good idea. It's there to protect the network from stupid apps or like naive apps where a I'm missing a slide anyway, it's There to prevent Something where you might be sending 10,000 one byte packets So you delay sending a packet that combines multiple small packets into a single larger one Linux by default waits 40 milliseconds Oh Slides so yeah The server we were using on the other hand had delayed acts on so the assumption is that this is Julie has a great All of this which is what I've been taking from so the assumption is that the server usually generates a response to a packet sent So if you send a high the server responds with a hello, so you don't have to do a received followed by a hello So serve our client sent the application then the server sits in silence waiting for another packet It's like okay. I'll act eventually, but maybe there's more to say and Then the client's waiting in silence It's like well, I'm waiting for an act, but maybe there's a network congestion. So this passive aggressive period where you're waiting on both sides Was where our 40 milliseconds was and eventually we're done So when Julia set the server to act immediately and not delay and do the worm late Well, maybe there's more to say we found it that everything sped up incredibly So we've talked a little bit about networking stacks and why knowing abstractions there are important and how my book keeps Setting myself up for failure speaking of writing operating systems in your own blood So chapter 3 was something that we call ointz So a note about our backups We store a small portion of our data in mongos and we Snapshot the disk to take backups of it every now and then and then we clean those backups and restore them to test out that pipeline so We when we restored one of our specific things we saw a pretty cryptic log message There was something that says caused by MongoDB exception. The beast on object size is negative Which is an invalid size that looks like a bitmask And the first element was ointz We weren't really sure where to go from this So there's some data corruption on Indexes in this thing. So the error we got was sort of a the index is pointed at a deleted dock, which is confusing Because as a note about the way you write to disk the kernel maintains a bunch of write buffers So it'll return to write calls immediately and then later it'll flush that data to disk Asynchronously this means that in any given point in time You might have half written data in various buffers or states as your database is in the middle of a right Before it confirms a right it issues a sync system call that clears out the buffers and writes everything to disk So before we took our disk snapshot We wanted to make sure that everything on the database or current everything that the database or kernel was holding on to Has been completely flushed out and written to disk. So you don't end up with ointz When we saw this bug arise a while ago, we realized that we did do the f-sync But we didn't actually lock the replica against more rights So snapshots Had a state where all of the data was flushed to disk But we weren't preventing more rights from coming in in the meanwhile. So we just scribbled on a data file while the snapshot was ongoing This meant that when we cleaned and attempted to remount the database it had half written data and barfed So this is a lesson in non-atomic rights and buffers friends Coming up next is the concurrency chapter of my textbook I've been told not to read this especially while on a plane I'm sure it's fine. So instead I was reading about hard drives and flash memory on planes Also exciting. I'm looking forward to the next bit flip that causes something to fall over There's also scheduling and how your operating system ensures that processes have their fair share of the Processor and so many more exciting things operating systems are cool. There's a lot going on in there and they present a fairly simple interface That's all I have