 Test, test, hey, it works, hey, okay, it's 11.30. I'm sure people will show up an hour later because they didn't set their clocks right. Okay, so let me get started. My name's Steve Rosted, Linux kernel developer. I work for VMware in their open source department. They still let me work on Ftrace and other kernel development, the real-time patch. So I'm still very much avowed and we also have user utilities that I'm gonna talk about that interact with Ftrace. And so this talks about debugging your kernel or debugging your Linux kernel using Ftrace. First off, this is not a tutorial. So don't expect to be like, oh, I'm gonna leave here knowing everything. The purpose of this talk, and I have my glasses on here. Oh, before I start, before I do anything, everyone has to smile. This is a real selfie. I'll post it online. Make sure it worked. Awesome. Shows how sparse this room is. So anyway, this is not a tutorial. It's more of like, this is, my goal here is so you can kind of have an idea of what's out there that you can actually spend. I have given tutorials. I did in New Zealand for LCA, but this is not a tutorial because I have a lot to cover and you won't be able to keep up. Cut to your try. How many people know what Ftrace is? Okay, okay. Actually, how many people don't know what Ftrace is? Don't be afraid. So almost 50-50, good. It's an official tracer, the Lynx kernel, as it says over here. You have it both likely, if you're running Linux, you have it on your machine. If you have an Android phone and you root it, you have it on your machine. Android actually keeps it enabled because they have a lot of tools that uses Ftrace. So it's everywhere. And it's really easy because you don't need any tools on top of it to actually use it. Everything, if you know Echo and Cat, Cat, you know how to use Ftrace. That's the two tools that you need to be able to use Ftrace is Echo and Cat. But there's other tooling I'm going to start talking about that makes things easier because Echo and Cat could be very kind of difficult to use in tedious. And we'll go through that as well. So again, this is not a tutorial, so I'm going to jump right in. We'll learn from examples. So we're not going to get bogged down in doing all the learning that's going on. So just sit back, relax, enjoy the show. First, I'm going to go over just the basics, which is Ftrace itself is a fully functional tracing, but the two most commonly used features of Ftrace are something called tracers and events. And they're not mutually exclusive. You can use them together. Tracers are sort of a general functionality. You have the function tracer, which basically allows you to trace any function within the Linux kernel. Well, almost any function in the Linux kernel. Any function is not in line by GCC. And also any function that you don't specifically say do not trace. There's a, I'm going to talk about function, function graph. There's other tracers that do preemption or that will check for latencies. Like you could see how long interrupts have been disabled. Those have more overhead. Those are not usually enabled in production. Those are more debug kernels. You don't want to trace every single time you have interrupts disabled and turn on because that code is actually highly optimized and putting any tracing infrastructure in there can cause problems, even if it's just a single know-op. Events are basically single know-ops scattered throughout the kernel that Ftrace dynamically enable or will actually turn that know-op into a jump to some code I'll actually do tracing. So when tracing is disabled, it basically has zero overhead. I've done, you know, I've compiled with trace events enabled or enabled but not running. So they're all know-ops. And then I compiled without them and done a bunch of benchmarks. Everything's within the noise. So enabling events is not a problem. Sometimes you have to worry about cache overhead because it does inject code within the file. So even though it's not executed, it may spread out the code. So your instruction cache, if you guys know what that is may not, it may fail a little bit more. Interesting enough, when I first wrote events, when I enabled the events, I did a bunch of benchmarks without the events. Then I added the events. I did a bunch of benchmarks. They were actually better than without it. So actually it lined up the cache to be more efficient by putting little spaces inside the kernel. I don't know how that happened. I couldn't figure it out, but that's just how, you know, cache is just a black box. You know, it's magic, black magic. But events, they have all sorts of different events. Schedule events, interrupts, exceptions, vulnerable things. So where do you find Ftrace? I said Ftrace is on your box, so where is it? Where can you find it? Well, usually most distributions already have it mounted. If you're root, Ftrace basically allows you to see the internals of the kernel, how it's working, which means that only super user could access Ftrace. So you have to be root, you have to do sudo or whatever you want to do. It would be devastating to the security community if we ever allowed users to actually have full access to Ftrace. Ftrace is basically the way to get around security. Root, kits, love Ftrace. No, I didn't say that. The, it wasn't recorded, right? Anyway, the way, if some people don't like to mount the debugFS directory because debugFS gives you the world of everything that's debugging, even though debugFS is root-only, you really thought that was recently happened, well, within a couple years. We don't want to mount, some people don't like to mount debug because it adds a lot of access to the kernel that some people don't want available to even for root. So, but they want tracing, because Perf relies on it, if you know about the Perf tool, and Powertop relies on it. So utilities rely on tracing, so where can you find it? So what we have is, I created a traceFS file system to separate it from debug. Although if you mount the debug file system, the trace file system is mounted in the exact same location it used to always be for backward compatibility, but you could mount it directly if you have that trace enabled in your kernel, you will know by looking at the directory syskernel tracing, that's a pseudo file directory that would be created from the sysfs file system if tracing's enabled. So if the traceFS exists, that directory exists, and that's where you want to mount it because tools will look for it there. So once you get it mounted, and you go to that directory, and you do ls, you see this lovely set of, hey, things doesn't really work very well. You see all these files. The highlighted ones are kind of some of the no more important ones, but don't worry about seeing them, but there's a lot of files in there. This is the API, this is the interface into ftrace. For example, if I want to trace functions, if you have your laptop open and you want to try it out right now, whatever you could, if you want to mount the tracing file system or if it's a debug file system, you're a root. In my slides, whenever I'm root, I always have that little hashtag as the prompt because I don't do pseudo. I never found why pseudo actually is beneficial. I just su to root and do my work and then exit. I don't want my user account have access to root commands. You know, that's here if you have it because you crack my password and you get my user account, you know, if it was pseudo, you could actually crack into a bunch of root commands. So if you get my user account, which is what I always do all my real work in, you have no access to the root facilities because I'm against pseudo. I've never understood the idea. Maybe someone else could tell me later on. But hashtags always all root commands. So I would cd into the tracing directory and I just echo function into current tracer. If I want to stop it, I echo no op into the current tracer and I'll actually disable it. And this is the output. So after I do echo, there's a file called cat and then boom, you can see all the functions that are running in your kernel live. If you cat trace pipe, it'll actually, it's a producer consumer which will just run forever. There's a difference between, there's two files, there's trace and trace pipe. Trace is not consuming. In fact, when you read the trace, it actually pauses tracing so it could actually iterate it. So if you stop tracing the trace file, you could read multiple times and it'll give you the same output every time. Trace pipe is a producer consumer. You read it, so it's made to be read while tracing's active. So it won't affect tracing. So reading trace pipe will give you the same output. But if you read it a second time, that output that you read the first time won't be there. So say if you were tracing and you want to stop tracing, you want to stop the buffer writing. If something happens, you want to stop it so you could look at the trace file. There's a special file called tracing on and all you have to do is just like echo zero to tracing on. That's how simple it is. If you want to turn it back on, so that stops tracing. It only stops recording. So function tracing is still happening but every time it goes to allocate, when it goes to allocate space on the buffer, it gets an all pointer saying nothing's been allocated and then it just jumps out. So though I don't have a buffer, let's just go on. So it doesn't actually stop the overhead. So tracing zero on there only stops the recording but any tracing that you have enabled will actually still have overhead on to your machine. So you really have to make sure you disable tracing even though the trace isn't doing anything. If you really want to make sure your system's back to normal. Word of caution, be careful about that space you do. I've actually had major kernel developers tell me that this tracing is broken. It's not working. I keep doing this. I keep getting the error on right. Why is this not working for me? How many people see the bug? Got a few. Okay, who really doesn't really understand this? Why one works and the other one does not? You are writing standard input and standard output respectively into the kernel. So you're just actually, that's a redirect. It's a special echo zero with no space greater than is actually a bash command saying the redirect. So I've seen actually major kernel developers saying this isn't working. I go over like you need a space and they're like, that's just caution. Function tracing as was mentioned earlier was does not give you any parameters. It doesn't give you any data except a function call was hit and tells you who called it. Very informative. You can see the functions. You can see the flow of command, but you don't get any extra data out of it. You just see, oh, this was called. But if you want to get more information, you want data, you want to see variables, you want to see a little more information, that's not supplied. But the static trace events throughout the kernel is used by the system developers and they will come up and create, put in little hooks, those no ops that turn into actual tracing and they'll be able to see information that like for within their system. Forks and interrupt goes up. What interrupt was it? You know, so you see it's okay. It's my, you know, Wi-Fi device is going off like crazy. So way you do, there's a directory called events in that tracing directory. By the way, anytime you see commands, most likely right now it's kind of in that tracing directory until it's not. So if I were to LS events, these are all the systems that are all these trace system events that are in the Linux kernel. And let's you know, by the way, I'm not the one that creates these. I'm the maintainer of the tracing system in the Linux kernel, but when a new event or new tracing stuff, I usually get CC'd on it and I will make sure they're doing it correctly. But all these events, all these systems and events were created by the maintainers of those events. So they actually are the ones in charge. They're the ones that know what data is good to look at for debugging or whatever. So it's not me just going to put in, this is all done by the actual maintainers who are doing it. I've added several myself because I actually don't just maintain tracing. I help with this, you know, scheduler and some other interrupts and stuff like that. So I have my own trace points there, but that's just as a different role, not as a tracing maintainer, but as a different role. So how do you enable tracing events? So say if I want to enable this wakeup event. So when a task wakes up another task to go run, so if something blocks on an event, something's got to wake it up. So I want to see when that's woken up. So it could be just blocked on an IO device. And I'm waiting for this, you do a read from a hard drive or something and it goes, puts a request in and then it goes to sleep, waiting for the read to come back and then the interrupt goes off and wakes up your process to read from the hard drive. That triggers the wakeup event and you can actually see it actually happen. So I want to see that event. I just simply do echo one into events, sked system, sked waking slash enable, enable events. By the way, I will post these slides up online right afterwards. I just finished these slides literally two hours ago. So I'll be posting them up online right after this talk. If I want to enable all scheduling events, I don't care, I want all of them, I just echo one to them. And if I just want to enable all events that are defined in the system, I just do echo one to events enable. Back, if I have all events enabled but I want to disable some, I can disable individual ones or I can disable them by system or disable all events in one shot. Here if they're not all enabled, you just echo zero to events enabled that will disable any event that happens to be running. Oh, what type of things do you want to do when you debug a kernel? Well, there's performance issues, warning messages, hung tasks, corrupt data, curl crashes, even at boot up. All this has been used, I have traces that have been used to debug. And I'll give you some examples. Now I can't do everything because it's just too overwhelming for an hour talk. So for a simple thing, this is, I'm going to start off for a simple example. This is one of the things I like to do because it's something I kind of run on myself. I always like to see how long interrupts last. Like if you're familiar with how an interrupt works, I'm not sure what the level, skill level is here. I think most people, how many people really don't understand interrupts? So it's like if you want, okay, I'll do a real quick thing. So basically when you type on your key, the keyboard, when you hit the key, the device is going to send a signal to the CPU. The CPU is going to stop whatever it's doing if interrupts aren't disabled. It'll stop what it's doing, go jump to a handler that's associated to that vector that was caused by the keyboard, touching it, and then it'll run the code to handle the key and then it'll most likely wake up whatever process is waiting for a key input. And it will say it'll put it into the buffer. So if I hit A on my keyboard, it'll put A into a buffer and then later on your process will get, an interrupt will happen, it'll put it into a little buffer and then you go and you look at, wake up your process, your console, and the console will say, oh, there's something in there, A. I'll read it A, then it'll display A to your video comment. That's the basic overview of interrupts. Oh, I want to do inter-platency. So what I do here, I use this, to define like see what inter-platency is. So I know in the Linux kernel, there's this thing called do IRQ. So it's a function within the kernel that most, all device interrupts. Timer interrupts don't use this, so there's another function, but just all device interrupts. Go through a function called do IRQ, which is basically called once you trigger, it jumps this and that's what figures out what device triggered the interrupt. So I echo that so into set ftrace filter, which filters, I only want to trace this one function. So I don't want to trace anything else. I only want to trace this one function. I disable tracing because I want, I want to do two things at the same time. I don't want to start one without the other for recording. So I disable tracing, so I echo zero to tracing on. Then I'm going to enable the function graph tracer. The function graph tracer is similar to the function tracer, except it also retraces the exit of a function. So it traces where the function enters and trace where the function exits. I enable the IRQ events. I want to see the IRQ events. So I get to see information. Remember, the function and function graph tracer don't give me any information. They just say a function, they just tell me that a function was hit. But I want to see what events, what caused that function to be hit. What IRQ was there? So I enable the IRQ events. I go through and I turn on, oh, the absolute time. That normally function graph tracer on the normal thing. I put this in here because this line over the timestamps that you see in the beginning, function graph tracer by default for prettiness keeps that doesn't, that doesn't display. So actually I want to see the actual all timestamps there. Otherwise, this first column in the list will not show up. So actually there's options that modify the output of tracing that you can go look at. Eventually I click tracing on and I cat trace. And here you can see the IRQ and this is what interrupt. This is actually on my laptop, the ACPI took off and it tells you it took 160 microseconds to execute that handler. That's quite a lot. That's pretty big. 20 microseconds is a lot of time. You know, it's more than a tenth of a millisecond. So, and then you see this other thing that, you know, a lot of them are most interrupts only take 33, 20 microseconds to execute. Granted, there is a little bit of overhead within the events to record events. I think a less trace is about 300 nanoseconds per event. Maybe a little bit more, 400 nanoseconds or so for per event. So that builds up over time. So the setFTraceFilter and setFTraceNoTrace is kind of like a quick overview of this. What they are is ways to limit what functions are being traced. So if you only want to trace, like I said, do IRQ, you would echo do IRQ and that thing. See if I want to trace all functions, so I leave that empty. If this file is empty, it means either all functions. Yeah, if the files are empty, it means either all functions, it means all functions basically, or actually no functions. Because filter means all functions, no trace means no trace, no function. What I mean by that is if I put any function within the no trace file, it won't trace that function. Because a lot of times there's locking. I want to trace all functions, but there's some functions that just are overwhelmingly like the lock functions. I don't want to trace locking functions because they just kind of fill up the buffer. I don't care about the lock. I mean I do care about locking functions, but usually I trace the flow. If I'm not looking for a deadlock, and I want deadlocks, then I'll do it. But if I don't want to see a deadlock, I only want to see a flow, I will disable the lock and I will echo a command to disable all the locks. But with the no trace file, but trace everything else. If I want to specifically, so basically the way this works is it first looks at all the functions in the set of trace filter. If it's empty, it means all functions. If it's not a limit to those functions in that file, then it looks at the no trace file and any function there, it will actually won't trace. The no trace overrides the trace. So if you put the same function, one function say if I put doirq in setftracefilter and I put doirq in setftrace notrace, it would trace nothing. Yes, with white space. So I could actually, white space to limit it. Thank you, I planted him. You can put there, you can put multiple lines. In fact, there's, when there's a, every so often someone, like Murray said, that all functions are traced except for things that are explicitly marked no trace. Well, there's certain key critical functions that ftrace will actually call into itself. And if it's traced, it'll actually cause recursion and crash the machine. So every so often in architecture or something will happen, they say, oh, someone will add a function into a critical section because trace causes the system to crash. I do a bisect of the, I could bisect what to find what function is because if you enable all functions, it crashes. If I enable a few, it doesn't. So I know some functions shouldn't be traced. So what I do is I take the setftrace, sorry, available filter functions, which is the file in that directory that lists all functions that can be traced. I cat half of it into setftracefilter. I just cat the, I just, and then if it doesn't crash, I say, okay, the other half is, and then I just take the other half and I give it half of that. And I go back and forth and it takes few iterations, but eventually I will find, that's how I find the functions that are bad. Yes. Okay, so it happens on the right. Oh, no, it depends. If you echo one, or say if you echo one into it on the one, it will actually usually kick it on that. On things like the filter files, no, it doesn't on the close of the file. So if you open it and put a bunch of things in, it just puts it into a buffer that once you close it, it clears it out. Actually, I shouldn't say no. I take that back. It's not on the close. Is it? No. No, no, wait, yeah, I think it is. I have to look at the code. I'm not thinking, it has to be on the close. It has to be on the close because if you echo a bunch of things, it enables them all, but it won't actually, it puts it into, it sets them up. And then on the close, that says, okay, now execute. Because all does is flip a bit. And what you're doing is you're flipping this thing, we're going properly. You're actually not saving what you're writing. It just flips a bit into the function table that lists all the functions. It flips all the bits, saying that these are prepared to be updated. And then when you close the file, it then goes through the list and then it'll actually say, okay, enable these functions. So, if you have a bunch of functions, you want to add another one. You use the concatenation of batch, greater than greater than. And if you put like a not function there, it'll actually remove it. And if you just write anything to it, it'll clear the whole list. Anyway, that's about all I'm going to talk about with the ftrace file system. If you want to know more, download Linux kernel and documentation trace, ftrace.rst, it used to be TXT, but now I guess the RST is a new way of doing things. This is where all the information, basically everything I talked about now and every option, pretty much every option, every file in there is documented in this file. So, that's why I'm like, I've had enough. I know people want to take pictures. I'm going to give up the slides, but I'll let you take a picture. Pictures last forever. ftrace is a tool for busybox. If you have busybox, you can use ftrace. So, that means cat and, you know, cat, echo, that's all you do, you could use it. But it gets, it's very tedious. It's, it could be problematic and it's kind of hard to use because you have to know all these commands, all these files and it could be running, you have scripts and stuff like that. But there's got to be a better way. That's why I wrote trace.tmd. Trace.tmd is executable that basically knows all about these files and it will do all the work for you without you needing to know much about it. It reads, there's also another file. I said, I talked about trace and trace pipe, but that's all, when you read it, it's ASCII. So, trace.tmd wants to use the binary data because it's all about speed. So, each since the ftrace ring buffers per CPU, it will actually read the straight raw data and put it directly into a file. Anyone here know about the splice system call? At two. So, most of the people here have never heard of splice. Splice is a great system call. It's not well known. It's Linux only, it's not POSIX, it's Linux only, but it's an awesome system call. Say if you want to copy a file, if you want to write a program that copy, the CP program, you write it yourself. You know the idea is you open up a file descriptor pointing to your file that you want to copy the source, open up a file descriptor where you want to write to the destination and then you read into a buffer into your user space. So, the kernel has to now send the data from the disk up to your user space program and then your user space program has to turn around and send it right back to the kernel and say move it there. That's kind of a lot of wasted time copying up into the user space and then copying it back down to the kernel. Is that a better way? Maybe you could memory map that works and map but sometimes it's devices that you can't map. What splice lets you do is say, hey, I got a file descriptor here. I want to attach it to a pipe. I have a file descriptor over here. I'm going to attach it to a pipe going the other way and in between I'm going to call this system call splice that takes two or takes, well, I think it takes, it'll actually take a file descriptor just to a pipe and a pipe back to a file descriptor so it's actually two splice calls and all splice tells it is, hey, kernel, take this data, put it into a pipe, which means it just copies a pointer so the data actually doesn't actually get copied anywhere. It says, okay, move it over here and then, okay, now take this data here and move it into there. So it just moves the data directly in the kernel without, so the page is just a pointer to the page that says, take this page, put it here and it does it, no copy. So ftrace uses a zero copy for the data for writing when it writes to from the ring buffer and the ring buffer is made to break it up into pages so actually it writes into the ring buffer and the way the ring buffer in ftrace works it swaps it out with an empty page so it takes that page and once it's written it doesn't copy it again. It just moves it right into the file system directory that sends it out. So traceman is really fast. To use it, download the Git repo, download the bottom of the Git repo or you could go to git.kernel.org, search trace-cmd and you'll find it in the list there so just don't have to remember this file but just there. Download it, Git clone, it's in distros, don't use it yet until we work on the packaging so we get a good packaging manager, he's gonna help you with packaging. Download it, make, oh sorry, let's call it do make doc, then you do sudo, make install sudo, make install doc, I had to put there for most people use sudo, I don't, I would just switch user but people say don't do that or don't show that. I do that, just don't do it. And then you have man-trace-cmd so all the commands I've talked about has been documented so everything in trace-cmd, man-trace-cmd will show you all the commands that are available so you wanna see what the record command does, you do trace-cmd, man-trace-cmd-record and we'll give you how the record works, how the report works, so again, this is not a tutorial, the information is offline. Let's do a little antidote. Anyone here heard of config, no hurts full? Anyone here a real-time person? Anyone got one real-time person? Okay, how about high-frequency trading and all that stuff, anyone here? Okay, okay, so anyone here have no hurts? No hurts is an idol when your system goes into no hurts or basically the idea of no hurts is if your system goes to sleep, you don't want it to wake up because of power management. The longer your CPU is in sleep, the deeper it could go and the less power it takes and of course it takes a long time to come out of that. So the kernel has a thing called a Jiffy which is a clock tick. If you ever see how that hurts, so the clock tick. So it'll do a clock tick so when your tasks are running it's used for scheduling, it's used for a lot of things. It's used for timeouts, timers, networking packet, stuff like that. So a Jiffy goes off tick, tick, tick. Well when the CPU, when everything's idle, there's nothing to run, why have this tick go on? So there's a thing called no hurts that will disable the tick and then your system goes idle and it could go into a real idle and that CPU may never actually ever come back online because there's no need to because nothing's running, it doesn't have a tick and when it wakes up again it looks at the clock, internal clock and says oh and it'll fill in all the ticks. I mean if there's supposed to be a timeout to go on it will know as it goes idle, it'll know, it'll check to make sure there's a timeout and it'll make sure it wakes up at a certain time for the next timeout. But if there's no timeout, if there's nothing on that CPU there's nothing, no timers going off on the CPU because if there's a timer on that CPU it won't turn off the tick obviously. So there's no timers, no nothing, nothing. So the CPU is going to get in a lot better sleep state. Your laptop on Linux lasts much better with no hurts than if you kept no hurts on. So no hurts full is here's where the real time thing comes in. What about a process that goes into user space? Now all the tick is about things that are for scheduling. Really if you don't have any interrupts going on and you're not doing anything in user, if your process is like calculating, you know, pi to the millionth thing and you're just doing everything in your own memory or basically maybe you have your, you memory map a device and you're just going to access the device directly and skip over the kernel, you don't want the kernel to interrupt you. So what no hurts does is if it's a single process and you set the CPU to be no hurts, it will tell us what this process goes into user space and there's no other things in the schedule it'll turn the tick off. So you don't have to do the tick. So it actually spins without any interruption within the kernel. It sort of works. There's a lot of other housekeeping that's going on but it could drop the tick quite a bit for now. So let's see if it works. So Frederick Weissbecker, who is the author of No Hercs Full, put up a little test program and in this test program, he has two things. He has a script, there's a bit of work you need to do to make sure all the interrupts and everything are not on a CPU. So you have to pick a CPU, take off all the interrupts, you got to do some work from preparation to make sure the CPU is pristine for only one task and then you run a task on it and the task he had was a simple while loop. So it was main while one. One thing, I posted this little patch because I did this because I found that it was doing a lot of page faults right in the beginning. Well, not in the beginning, but later on. It's like page faults are kind of scattered in because for some reason it does weird things. I don't know. So I was like, do mlock all right in the beginning which basically says, pull in all your memory right now, don't page fault it. So everything's in the page table so it does all the page faults at the beginning and then it should run forever. So I executed this and this is the command I used. Trace command record because I'm gonna record the data into a file. Record, if you replace this with start it just enables it into the ring buffer or Ftrace like I did it through echo. I did it directly to the file system but record will actually copy it to a file that I could look at it offline. Function graph tracer dash P means plugin. Okay, tracers used to be called plugins when I wrote trace command but now what I call them but now what plugin means something else. So I have tracing but because of backward compatibility and historical reason dash P is still there. I hate it, but I don't like to break backward compatibility. So dash P is plugin tracer. So function graph tracer and max graph def that's highlighted. It's telling the function graph tracer only trace the first function graph that you get instead of doing the whole scale like tracing everything. The first function it calls into the kernel that's all I want. I don't want you to trace anything else. So it only shows me entry into the kernel. So I'll see page faults. I'll see anything else. So I'll see a system call. I'll see any time the user space goes into the kernel I'll see it because and I only will see that. So I record it. I just, I added interrupts and scheduling and I kicked it off and then I said task set that's a command to say run this user loop on CPU three because that's the CPU I had no heard set to and then I did trace command report dash L means give me the interrupt status that L will give you this little tiny thing right there and then CPU three fails me I only wanted to see CPU three but later on I had a bunch of stuff and I'm looking at this I'm like, huh there's a lot of text here. Why is it doing this? I want to get a visual. I want to see where times are and I couldn't get that. It was too much text, too complex. I mean, do you see like, okay, yeah but that doesn't tell me why I want. I want to know where, how much is the kernel interrupting my user space program? That's why I want to see introducing kernel shark. So kernel shark is a GUI front end for trace command. It was originally created in 2009 under GTK but when GTK, it was written under GTK two but now GTK three is out and I hear four is out now and basically I had a choice of rewriting my kernel shark for GTK three. Finally, I threw up my hands and said, let's do QT. So VMware hired Yordan Kodrschoff, something like that, I'm getting closer to pronouncing his name correct. He's working with me for that. Who is now a co-maintainer for kernel shark and soon will be the official maintainer of kernel shark. He wrote the code from scratch. I reviewed his code but he did the grunt work. It's in the same repo as trace command.git but there might change, I don't know, we'll see. And instead of doing make doc, you do make GUI and it'll build kernel shark. This is what it looks like. This is what I ran on that user space and this is where I filtered out, there's ways to filter out CPU zero. By the way, kernel shark.org will have all the documentation on how to do all this. It's not quite there yet. It still has the old GTK code in there. I have the new one out but it's not available yet. I haven't switched it over. But once 1.0 was officially released, it's still not quite released. I'm hoping maybe next week. I've been saying that for the last month. So once it's out, you'll have the documentation and everything for you. But see, okay, that line right here, I only filtered out CPU three and this line shows that the process is running. When it's no line, that thick orange line, it could be different colors. When there's no line, that means that there's no process running, it's idle. But these little ticks here are events that I found. So you can see that this isn't, this kernel's doing something while that guy's running. Otherwise it would be a smooth line. And I clicked marker A and then I selected here and I hit click marker B and I selected here and up here you can see the difference and what is it? I can't really see it. It's four seconds. So every four seconds, the kernel interrupts. So it's pretty good. Usually it goes off like a thousand times a second. The tick while the process is running only goes off once every four seconds. Not what we want but much better than what we've had. So systems could be less interrupted by the kernel for four seconds at a time. Then I ran something in the background. On another CPU in user space with no priori, it was my own user task and this happened. See that big blob right there? Within the four seconds I get this solid line of events. I zoomed it out. There are about 12 microseconds apart between 12 to 16 microseconds. A tick is going off. Is that, oh no, milliseconds. Sorry, not microseconds. That's the, I don't know what to say. 12 milliseconds, I'll say that's pretty good. Sorry, sorry, wrong magnitude. 12 milliseconds, still going off 12 milliseconds every time on something that's not supposed to be interrupted at all. How does this happen? I know 1% knows the answer. So by the way, I zoomed in here. That function here that's calling all the time is this SMP call function single interrupt. This is an interrupt processor interrupt. This is when one CPU sends an interrupt to another CPU for some reason. So I want to know what the heck is going? So this is the code. I actually took the latest kernel, said here it is. Here's what the code is. And don't worry about this. This trace call here is not really a trace point. It's actually something that helps tracing within interrupts. But that's not, that has nothing to do with this talk. So ignore that line. But this guy here calls here and I went down I kept looking at who this guy called and what this guy called. And finally I got to this file and where it passes in this calling context descriptor. So this is the function that the other CPU asked it to call. So what is this function? I don't know. So I'm stuck. I'm like, okay, I get this IPI. Some functions there. I don't know what's calling it. So I am like, okay, let's do something different. So instead of doing max steps equals one, I don't want max steps. I want to see the whole call graph. But I only want to see function, I only want to graph the SMP call function single interrupt. I don't want to graph anything else. So I do trace command with the dash G option and I put the function name that says graph this guy only. Don't graph anyone else. And then I ran and I got to see it and I can see how who called the function and followed everything and this was where the funk was. So now I can see what that funk variable was a function pointer that was passed to it. And this is a function that's called. So I grep it, get grep, put that function in, found the file, looked at that file. It's called by two functions. Well, this is how you, this is the file that tell, this is on the other CPU. This is what the other CPU called. The other CPU called this function and said, I want to see this. I'm like, let's trace those functions. So this time I don't care about function graph trace or I only want to see those function and I want to see who calls it. So it's not the graph tracer tells you what a function calls, but the function tracer with the option and trace command with this one funk stack will give you the stack trace back up. So I can see who called this function. So now it's the other way. The first time I want to see what did this function call? I use the function graph tracer. Now I'm using the function tracer with the stack dump to say, who called this function? So I got to see the other way. So I put in this, I said the L means limit, the limit, this actually dash L, what actually trace command does is writes these functions into the set ftrace filter. That's all it does. It actually takes that, it's just like cat, those function names into set ftrace filter. That's all it's doing. So it's only doing that. I put funk stack which says, give me the stack trace. By the way, ftrace won't, or trace command won't let you do it, but you could do it by hand coding. You can enable the funk stack option and you can enable all functions in the kernel and expect your laptop or whatever machine you do to run at a Commodore 64 speed. Because every function in the kernel is doing a stack trace. That has a bit of overhead. I think like 10,000x more overhead. So do that and it might take a while. You hit a button, like I've done that by accident and I have to turn it off. So you type on the screen and you wait 10 seconds for the keys to appear and then you say, crap, cause you did a typo. Wait, I did this function. And I looked at the stack trace and two, this is who called the function that sent the IPI. And I see this proc reg open and a CPU info open. Anyone knows what application or how could I get those functions called? Proc reg open followed by CPU info open. Anyone familiar? Yes. You mean this? I ran that on the other CPU as a user, normal space. And what I found out, I just found out yesterday, thanks, Brendan, that cat proc CPU info is not just a read-only function. It causes something to happen. See that CPU megahertz? Well, it's calculated. And it can only be calculated from the CPU that it's showing. So when you want to calculate the cat proc megahertz, by the way, so I take that if you could do it on yourself, you do cat proc CPU info and you run it again, you get a different number. Every time you run it, you get a different number. So it sends an IPI to every CPU to say calculate this for me. Pretty expensive. Back to debugging the kernel. So you have warning messages. Everyone loves this. How many people have seen this? Not many. Okay, every so often you see this. When you see this, you say, you get a warning message. That means something happened in the kernel that wasn't supposed to happen. And I really wanted to do an antidote on this. But you know what? I hate to say this. The Linux kernel has improved so much because of Linux necks and all these zero day bots and fuzz testing that I couldn't, every bug I found, I was going to go and debug F trace and I'm actually was going to say, I've done this before. I've done this in the past. I've actually found bugs and debugged it at the conference and presented it. And I was going to do that here. But all the bugs that are available are for strange hardware that I don't have. And I went back and went, oh my God, the kernel is so solid now. I don't have to actually hit this much anymore. But if you have strange hardware, you will hit this. By the way, so to cheat, well, first I want to talk about stopping the trace on warning. So say if you want to enable all tracing and when you hit a warning, you want to see that tracing. But if you, the ring buffer, because they're not going to do the record a lot of times, you just care about what led up to it. So you don't want to read everything. So I'm just going to let tracing the ring buffer go. And if I hit the warning, I want it to stop because otherwise it will blow away and the ring buffer is only a finite resource and it wraps around and it will overwrite the cause. So I want it to stop so I can analyze it in the kernel in the memory. So there's a thing in the proxys kernel, there's a few flags in the proxys kernel. If you ever do a cat of proxys kernel, you'll see a bunch of flags that have little ways of tuning the kernel. One of them is trace off on warning. You echo one in there, it will turn off tracing. So, because the glissar was so bad, I think let me just throw in a warning. So I put in a one-on-once into the RT music set prior, which says if anyone boosts a process to 99, which is the highest priority, which in the kernel is zero, it's inverse, there's a reason for that. But anyway, anyone boosts a task to the highest priority, I want a warning on it. I don't want anyone boosting tasks to it. So basically it's priority inheritance. If a priority inheritance block happens. So I took my migrate code that I usually, I use for testing migration of real-time tasks. It's on my home machine in migrate.c. I take this code and I compiled it. I enabled all functions. I echoed one, traced the turning off trace off on warning, compiled all functions. Check to see if tracing is enabled. Yeah, it's on. Then I ran my migrate task passing in dash P99, which says used of max priority on this. And it does priority inheritance. And what happened, boom, something got boosted. And there it triggered. I do a cat and sure enough, zero. D messaged to shale. I say, hey, it's been something. So I'm just letting you know that that's how that works. So I did a trace command show by the way, just to see if it did. And you can see that it's the migrate test and it ticked off the warning. So there, I got to see that. So if you ever want to stop tracing, that's what you use. Now, if I want to look at it, like I said, I didn't have it in a ring buffer. So I'm like, I want to look it up online. So trace command extract will take the current ring buffer and put it into a trace.dat file that you could do all the functionality of trace command report or kernel shark. Let's go on to kernel panic. You know, you hit bug on, it crashes. Okay, what do we do then? Your system just crashed. There's no cat trace. So you enable all functions. By the way, if you want to test this, hey everybody, open up your laptop, running Linux, echo C into this proc, proc sys request trigger. It will crash your kernel. It's actually a way to do it. You may at first have to enable it, but if you echo negative one remember that proc sys kernel has a bunch of tune things. Well, there's one called sys request trigger to enable all the sys request triggers. And if you do negative one, I enabled all of them. And one of those is, I want to be able to crash my kernel. So F traced up on oops. This is a way to print everything that's in the ring buffer out to the console. If you have a serial console, which very few people have today, but there's possible ways to do it. You could get it out to the console or other consoles. So when it crashes, it'll actually print out the buffer. By the way, if you have a serial console, especially if it's like running at a really slow ball, but even if the, you know, what five, what was it, one, one, five, what's the ball rate, this one, one, five, two, something like that. I can't remember. Okay, most servers. Yeah, so most servers, and also virtual machines. Virtual machines have serial consoles too. So that's awesome. I do it so I put a serial console. If it crashes, the thing in virtual machines can stop easily too. Anyway, so I usually put Echo 10 into the buffer side, so I shrink the buffer. So, because one times, because by default, it's 1.4 megs per CPU and 1.4 megs of data per CPU. But by the way, when you boot up, it's only if it's only if you can't, when you execute it, it expands the buffer. So tracing, when you have F traced, you're not using 1.4 megs per CPU on boot up. Only when you first do anything in tracing, it will allocate the buffer there. But the default size of the buffer is 1.4 megs per CPU when you enable tracing. You could change the size of any CPU per CPU or if you would do them all at once with this guy. So I always say Echo 10, so it's like 10K because if you have 1.4 megs on like seven or eight CPUs and you trigger the F traced up on, oops, well, maybe you'll get it by next week. So it's a lot of data to wait for. By the way, dash B, you can pass, you can change that in trace commands. Trace commands start, dash B, 10, B, dash B, no op. I just set my buffer to be 10K each per CPU. So how do you enable F traced up on oops? It's a sys control key, another proxys kernel trigger. So anything a proxys kernel is also could be enabled by sys control. So you echo one into that, and it'll turn F traced up on oops, which will dump all CPUs. But say if you only care about the CPU that triggered. So maybe you're not worried about a race condition, you're only worried about what got up to that. You can actually put in equal to or on the kernel command line, a ridge CPU, which means that when you put two in there, it will, when it triggers, it will only dump out the CPU that crashed. It won't dump out all CPUs, it only dumps out the CPU that crashed. And this is like what happens. I kick, I triggered it and boom, and you'll get a little message there. It says, you know, dumping F traced buffer and then it gives you the buffer out to serial console. KXK dump, how many people use this? Not enough. I love KXK dump. You know, REL, Red Hat Enterprise Lance, I used it for Red Hat, does it all the time. It's awesome. The first thing you do is you load the kernel crash memory like 128 megs. That's where the kernel, it's going to load the kernel into the memory. You run K dump control, or have, you know, system D or whatever your net process does cause it. It will actually load it up. It also creates a net RAM disk that's going to be called when the system crashes. And then finally on a crash, your system will reboot. It will actually jump to that kernel that's loaded in memory. It's a different kernel than the kernel you're running. So your system's running, it crashes, it will jump into that other kernel. The kernel will boot up and then it will take, it will mount the file system you tell it to, and then it will do a core dump of the entire memory into a var crash VM core and it'll save it there. So now you have access to a core dump and then you can analyze it with GDB. And I did a snap, this is QEMU, KVM. I did a little snapshot of my window invert manager of the happening because it didn't do it over the serial. So down at the bottom, you can't really see it, but it's showing that it's actually copying the, it's telling you, it gives you a per progress car as it writes to the VM core into the file. Crash, anyone here with the utility crash? I didn't put the link up. Darn it, I was supposed to put the link up. There's a link, I have a link, search, crash, Linux, no, how would it be a good, it's hard to, one of these things are hard to search on Google, David Anderson, I can't remember his name, David, something or other. Or CCME or maybe I'll upload the slides, maybe I'll add it in there so he uploads, I don't have the link to his stuff. Crash is awesome, he also works for Red Hat, it's a thing, so it's run like GDB. In fact, if you notice it starts off, it has GDB, it actually compiles, if you download it and compile it, it'll actually build a full, it'll download GDB and build GDB for you, and it's extension on top of GDB, it's basically GDB knowing all the structures of Linux. So you put in Crash, you pass in the VM Linux kernel, you pass in your VM core, and then it'll boot up, and then now you can actually, you can look, you can do LS, or not LS, you can do like a PS, and it'll show you all the processes that were running, it's really, really powerful. I recommend learning it, but that's not part of this talk. What I did was one day, someone asked me, Lai Jiangsheng, I think his name was, he's not really a current developer anymore, but he, when I first did Linux, he asked me, hey Steve, this trace.dot file that Trace Command uses, can you tell me the format of it? I said yes. Download the Git tree from tracecommand.git, type make install doc, or make doc, make install doc, and do man tracecommand.dat. And that will list you the format file, that shows you, that's the description of how trace command is formatted. Two months later, I can just CC it on this patch to crash that built an extension to it. You do type, this is part of crash now, extend trace.so, it loads it, you do trace.dump-t trace.dat, it reads the VM core, pulls out the ftrace.spring buffer, figures out the events, creates all the events, actually creates everything in there, creates the kalsims file, and then creates a trace.dat file out of it. Quit, trace command report, boom, you have your data from when the crash happens, and you could pull it up in the curl truck and see everything that's gonna happen. Yeah, by the way, that first line here is where the crash happened. This is right here, spit out a bunch of, I guess, prints. This is where actually it went to, it kicked off the kexec, and that's where the last review of it, yes, which was, can't really see, 21 milliseconds, yes. Ftrace buffer is only useful if you're tracing something. Yes. So is this something useful for applications that you can actively reproduce curl crashes? No, well the thing is, believe it or not, although you could change the size of the buffer and everything like that, there's people that run ftrace on production machines, have it enabled and recording, and they have maybe something, so as long as you have an idea, you can actually record certain things, certain events, and it has a very, very low overhead. It's very, very fast. You can actually record an awful lot. I mean, you know I go full-blown function tracing, but you can have a lot of critical events, and if a crash happens, it'll spit out, and you say, oh, this is, you have an idea. It's another way of seeing what's happening. By the way, this crash utility, when I worked for Red Hat, there was a bank that was crashing on the real-time kernel, and they were saying, but they couldn't give me a reproducer. It was on their production machines. So I would just say here, turn on these tracing. They were able to reproduce it, crash it. They sent me the core dump. I analyzed it, figured out, oh, the problem's between here and here. Okay, ignore everything else. Turn on some different tracing. Core dump, three or four iterations. Let me write to the bug. And it was within a week. No, no, no, I'm saying people run on production machines after it's enabled. No, no. If you enable all tracing, yes. There's certain things, like even, like it matters what, we have to take this offline. Yes, yes. This is going to be very helpful, because we have a special kernel module, and we can tell customers to re-enable traces for that module. Only don't come in, and then when you get it back down. By the way, one of the features of, it's in the documentation of the setftracefilter. You could do setftracefilter, colon, mod, colon, ext4, and it'll enable all ext4 module functions only. So you actually, there's actually a command in setftracefilter to say, just paste these modules, this module. Anyway, just before this time's almost up. In fact, I think, yeah. So, have five minutes. When all else fails, I just want to do one last thing. This means you have access to the kernel code, and you can modify it. Traceprint K. Traceprint K should not be used if you enable this. You're going to get a really nasty dmessage bug that says, you are running a debug kernel. If you see this on your vendor, notify them immediately. I did that to keep people from ever enabling this internally. It won't hurt, it just, I don't like, there's a reason why, there's better, trace events are ways to do it. I don't want people to be sloppy. People get sloppy with trace, because Traceprint K is not a good parsing product. It's just a debug thing. It's like printk, printk is like printf in the NC, but also printk and Traceprint K have special features like you could do percent ps, which will like, if you could do percent ps and you put a function pointer there, it will actually translate it to do IRQ, bitmaps, MAC addresses, IP addresses. I put in, yeah, lowercase p capital I, that's a capital I, it looks like an L, but it's actual capital I, that's why a PI there, and et cetera. There's a bunch of things that you could do, but Traceprint K, you could do, I always do the shotgun approach, which is, okay, this is my shotgun approach. So I took prepared task switch inside the scheduler tracer, and I put Traceprint K, the first time I actually put in some information about the parameters, so you know, and everything else, you'll see funk line, funk line, so I just cut and paste, I just blast it. I just, almost every line has this Traceprint K thing. This is when something crashes, I want to know, or something changes, or something I want to know exactly when it happened. So I put all these things, and then I compile the code. By the way, funk line is macros that GCC will actually put in the function name and a line name, and this is what I get. This was a show. I want to see if anyone could see something here. By the way, Traceprint K will put, will actually print the function name first, and then the format. Why is the function name different than here? And what have an idea, why? So you would think it's being called in a function. This is the function name that it got from the instruction pointer, and this is the name that it read from here. Yes? Inline? Correct. This function, if you notice up on top, you can't really see, it's static inline. So when it get inlined, the symbol disappears out of the symbol table. So when, that's why I put funk in there, because it catches inline. Funk line is a compiled time, it says, oh you're in this function, it puts in a string name for whatever function you have, whether it's inlined or not. The chaos sims does it from a lookup table of whatever function's available, and if it gets inlined, that symbol disappears. So this guy was called from the scheduler, was inlined into the scheduler, so when I did that, I got schedule, and then I got actual function that was called. So that's why I always do funk. Some people say, well why do you put funk there when you have the IP address? I'm like, because sometimes it's inlined, and GCD may inline something that's not said to be inlined, so be careful. Anyway, thank you. Questions? Ready what? Good question. I usually don't, because CPU zero is usually the default things. Like sometimes some machines don't let you shut down CPU zero, but sometimes they do. I mean, it's dependent on the architecture. Really, there's no, nothing technically wrong, but it matters whether or not the hardware expects to have some services running. Sometimes you can't move a CPU, or you can't move some sort of, like if an interrupt is hard coded to CPU zero, because if an interrupt is hard coded to anything, it usually will be CPU zero. So it usually makes CPU zero where I move all the interrupts to, and then move everything else off of that. That's usually the idea of why I do things. Any other questions? Okay, thank you. Oh, no, wait, yes. Right after this, I'm going to actually go, and I said I got done with this literally two hours beforehand, and actually I started to do a little more tweaks, so right after I get done with this, I'll post it up to the scale site, upload as well, I know it's a well said livus, anything else? And it looks like it's exactly one hour, so thank you very much. Boom, boom, boom, boom. We'll give people just another couple minutes to come in so we can start quote unquote on time. Is this thing still on? Cool, shall we get started? How's everybody feeling today? Y'all will punchy after a long weekend and a very filling lunch, and setting up and tearing down booths and seeing very heady or very lighthearted talks all through the weekend. Yeah, yeah, yeah, lots of nods and lots of snoozes, that's excellent, you know? Nobody's snoring that I can tell, so we're doing great so far. Awesome, cool. So, for those of you who don't know, my name's Brian Weber. I'm a site reliability engineer at Twitter. Anybody else here is also who is an SRE? Throw your hands up. All right, no? Oh, production engineer for you Facebook folks. All right, close enough. Okay, what about like a CIS admin style title? Hey, cool, love that. What about anything DevOps related? Yes, no, maybe. Okay, how many of you are just like, about how about managers? I wanna see the managers out there. I have a lot of love and respect for managers because you do the very thankless job of putting up with people like me in mass, at scale. So, I love and respect everybody who is a manager. You're all great, but that's not what this talk is about. This talk, I wanted to give a little peek into the view of the world of what I do as a site reliability engineer. So, I kinda collected just a couple of lighthearted war stories, just some issues from dealing with production services. I can't talk about the really big outages because otherwise comms will come and fire me and I don't want that to happen because I actually like what I do most of the time. So, I wanna cover some of the fun things I've encountered and some of the lessons that I've learned and some just general stuff that I get to deal with on a day-to-day basis. So, seems like I may be preaching to the choir here, but what the hell, let's go for it, right? So, again, SRE at Twitter for the last, it's almost four years now, but I've been doing stuff generally in this field for over a decade, going back to my days of doing telephone tech support for a company that sold voiceover IP systems that were like PVX systems in voiceover IP. You see, buy this box, slot it in your rack, plug in a bunch of phones and they'd wire up and eventually your own network would disagree with all this voice traffic going wrong and they'd call me and I'd be the guy who'd start by, have you tried turning it off and on again? They'd say no and then, well, let's go further. So, that's kind of where I cut my teeth learning about how to troubleshoot systems in production because your customers at that point were not like half a million Twitter users or similar sized at other companies, but the dozen or so call center employees that are trying to make money for whatever company was our customer at the time. So, those were the people who were more interested in just making sure they could do their job every day or who were threatening to sue us if things went wrong or who were just like, well, gee, I don't know how to clear my voicemail after I've listened to it and the simple stuff. So, while I've been working in tech really the way that I see myself is I've been working in customer service. My history even goes back further because I've also worked in restaurants, I've been an instructor, I've done all kinds, I've done some sales, I've done a little bit of this and that for a long time before I actually started doing any legit tech work and that's kind of what helped me be really good at doing tech support because I picked up the actual tech skills along the way that I needed to and that's part and parcel of what I'm gonna end up talking about along this talk. So, hopefully you can remember that and hopefully the next time you're talking to somebody who's saying, gee, I wish I could do tech, it's too hard though then I hope to try and be a shining light to say doesn't have to be, just horrible blood sport for college grads. So, when I moved up to the Bay Area about seven years ago that was when I got introduced to Python, thank you Facebook for that and totally fallen in love with the language I really enjoy writing simple tools in it which is a far change from writing God awful Pearl scripts back in the day and that's not a hack on Pearl that's a hack on my abilities and engineer so take that for what it is. So, I've worked on all kinds of things, product support, just general services, I've worked on advertising, right now I'm the sole SRE with the Infosec org at Twitter. We've had SREs turn into Infosec engineers but I'm actually working with them as an SRE helping the services that we do to make Twitter more secure and more private both within the company and without more stable and more reliable. It's been very fun and educational and as you can see, I'm a very serious person all the time. So, again, like I said, this talk is gonna be just a few little short stories about things that I've encountered in my career, things I've encountered just in the last few jobs. I mostly work with tools and services that are written in Python just because that's what I like to write but again, Twitter we're the largest scholar shop out there so we also have lots of services that are just running very fun and complicated Scala code which has its own host of complications which I probably will not get into in this talk and that's mostly because this talk was written as a 30 minute talk. We're out for an hour, right? Okay. This is what I encountered. You may encounter similar issues. Again, this is just to talk about kind of the spirit of what we've encountered. This is not a best practice talk. This is not talking about the absolute right way to doing things because what's right for Brian and what's right for Twitter may not be right for what you and your company but hopefully just we can all still learn from each other. I learn a ton from other people when I go and see other talks hear about their infrastructure and the vast majority of the time I run back to my company and try and implement what I learned and it's completely wrong because it just does not fit with our model. So instead it's always best to just try and draw inspiration instead. So again, I'm no expert. I'm just some dude up here trying to give a talk. Cool. What exactly is an SRE? So I've heard a lot of people try to define this. Say that we're software engineers with a different focus and that's somewhat true. We all write code. We all write code to do very specific things. We have sometimes specific products that we support. We have specific customer sets. For me, my customers are the two teams that I directly support in my day to day but they're also their subsequent customers because for those teams they just see me as another team member. That's how we like to do things with the embedded model at Twitter. We also kind of see the SRE org as a customer service provider for the rest of Twitter. So a lot of what we've been trying to do more and more is get together as an org, all of our distributed members to try and make better services for our own internal company. The MySQL team, they're almost entirely SREs. There's core infrastructure which manages things like our puppet infra and a lot of other things for provisioning servers. They're almost entirely SREs as well but then there are folks like me who are embedded with teams and so we all try to collaborate to see how can we help each other better at the company? So I don't know what your definition of SRE is but that's kind of what I like to try to do. What we have also been doing is trying to get a better definition within the company of what it means to be an SRE. So some of what we've been coming around is coming up with better preventative measures and trying to advertise where our specific focuses are which is knowing more about what's going on outside of the product. So my sweet teams that I support they really know their stack. They really know what's going on. When that API calls hit exactly what system calls go through inside the Scala code what they hit in the JVM how efficient that call is and what it spits back out. I certainly know a lot of that but from a higher level most of what I know is what's going on outside of that. How does it deploy? How does it interface with the databases? How does it interface with other downstreams? How do our upstreams interface with us and what do they care about? So because we have my perspective which varies from my team's perspective we're able to look around our own applications a lot better to kinda help make a better application for our overall team's customers. The other thing that I've been I really need to update this slide. There was a great talk yesterday who was in this room and I cannot remember who was speaking who was talking a whole lot about how SREs work often helps with reducing toil. And that's just the grudge work of doing our jobs. We're constantly rebooting yet another server constantly re-running that one script and only we have permission to run things like that. I finally have convinced that one team member that I work with that I should not be the only one who can shell onto a host to look for a file and it's been a wonderful thing. So now the only time he calls me in is when it actually is broken and that's a great thing. So this is what I try to see as an SRE. So to really shorten it up we also look more about the stability of the service. Where is our alerts? Where are actual breaks and our failures? Where are we slow? Is it within the app or is it within our downstream and then how do we target that? Are we running lean enough? Do we have too many systems in place? Too many services in place? Too much compute? Are we costing our team money or the company money? I'm adding this in about security. Are we actually locked down? Are we properly using TLS? Are we properly encrypting all of our data at rest? Are we staying up to date with all the patches that are coming in from the security team? And then are we properly scaled for the influx of traffic that's gonna come in from that next new feature that they turn up upstream? Are we all ready for that? And then are we gonna get woken up in the middle of the night? So in summary, before I even started the talk. So the other thing too that a lot of people look at for SREs and this is kind of like what I've perceived as branding of my teams, how other people often see us is we're the crash cart people. When things break, when success rate plummets, we're often the first ones on the scene because we've done this before a thousand times. So there again becomes the shift in tooling and perception to say hey, how do we instead of just being responsive to outages prevent them from beginning? So that's what we try and do. So I've rambled on along, let's get to some stories. So I'm looking at our Slack channel and one of our teams comes in and said, one of our customers comes in and says, hey, also the application that we support is that we have a local daemon that runs on almost every box in the fleet and manages secrets. It's fetched from an API that we control locally or control centrally and without getting too far into details, this just because again, legal can't let me, we have this service that just makes sure that the TLS files and the database login passwords are specifically mounted in a secure way that only a service owner can access it on demand. It's great service, we love it. So someone calls me up on Slack and says, hey, we can't access our files on this one host. Secrets aren't there. So what do we do? Go ahead, sysadmin, deseries, production ends, you get this call, just go ahead and shout out. What would your first thought be about when someone calls you up and says, hey, the thing's broken? Turn it off and on again, what else? What changes did you make recently? What changes did you make recently? That's a great one. Look at the logs, exactly. These are all different things that we do. Anybody else have any fun ones you wanna throw out? Check if it's true. That also, absolutely. So often you go and you look and you say, no, it's actually just there. I've been on the person saying, yeah, hey, it's broken. I'm looking at the wrong damn host the whole time. Of course, that's what we do. We look at the host, make sure it's true. We look at the logs. Eventually we say, okay, maybe something actually is broken. And eventually we may start looking at our own code base and then we just keep going on to the next thing and the next thing. So I start looking at the logs and I see this. So the way that we make sure that our secrets are properly loaded on the box is we create on-the-fly fuse mounts. Who's familiar with what fuse does? Cool. So for those who haven't raised your hand, fuse is a in-memory user space file system. So it's not on disk, which means you can't pull the platter out and visually inspect the dots and find out what the contents are. And if you shut off the server, you lose it all because it's all in memory. And it's only existing for the user in space, in the user who's accessing the file and has permissions to. So for us to actually be able to do this, we write our own code that loads the Python fuse library, which loads libfuse on the host, which is a shared object library. And we say, hey, it can't find that library. So I go and I inspect the code and I say, what does this really mean? I go look for the fuse libraries and I say, oh, well, they're there. So why can't it find it? So what's going on here? So I take a moment and I say, is this really one host? This is our Hadoop fleet. They've got somewhere north of thousands of hosts in their fleet. So one won't kill them to reboot. So I ask them, is this the only one? Because that just seems really weird and really corner casey. And they say, well, yeah, this is just the only one. So I say, well, I've got like this huge backlog of tickets. Why don't you just punt it, rewipe it and let's see if it comes back. Because problems at scale sometimes are just the one off-corner host. And so you can kind of get away with that. So do you wipe it off and you move on, right? Well, not always as easy as it seems. So it turns out it wasn't just on one host. It was on a slightly larger but not significantly damaging subset of hosts, in the neighborhood of less than a half percent of the fleet, but it was still enough for us to say, okay, there's probably something systemic here. Let's take further look. And I just said that. One of these days, I'm gonna get myself out of Google slides so that I can actually have the preview of the next slide. Kind of cool. So we found it wasn't necessarily following a deploy or a code change. The hosts other than all being within the same Hadoop cluster had nothing apparent in common. They weren't like all on a common feature branch or puppet branch running some sort of like weird code that wouldn't do something. So we try to dig a little bit deeper. That's where I take a deep breath and say it's time to keep exploring. So back to that error. We say what is really going on in this code here? So I dive a little bit deeper. And this is where I actually start learning some stuff about some stuff. Again, I've been doing stuff in this space for like 11 years. I find myself learning something stupid and remedial or seemingly super lit and remedial all the time. So this little snippet of code is in the Fuse Pi libraries where it does every kind of system load for shared objects that it can do within the Python space. So that error was what was raised, which was looking for that specific value, which was calling that function find library. What does find library do in Python? This again, continuing to learn. That's where I finally for the first time in a decade of being something in this stuff with Python, first time I actually got to this specific function in the CX library. Nobody's an expert no matter what they claim to be, not even me. So what this is doing is it's looking for a shared object using common functions and common systems that are in your operating system. In this case, it was using LD config. LD config checks a whole bunch of folders on your system for shared objects and stores them in a cache. So has anybody here ever had to deal with LD config specifically or specifically pick out libraries in the LD cache? I see one hand, two, three, okay. Still a very small number of people for this crowd. That gives me hope for the future that we're all still learning. And that's why I love coming to these conferences because it helps me validate that I'm still stupid. Excellent. So we go and we dive a little bit deeper. Just on the broken host, I exercise the common Python calls and I see, yeah, it's nothing there, but on a good host, there's the shared object. So let's step a layer deeper and see what's going on at the OS level. I see the files are there. So the files are there, that means somewhere in between the files being there and C types finding them is broken. The package is installed, but LD config is missing a whole bunch of stuff on that broken host. I go and I repeat the exercise on the, I don't know, maybe dozen or so other broken hosts and it's this exact same scenario. The LD config was just not collecting everything on these hosts. Now, these are not my hosts. Well, again, that was our specific root cause for that instance. Now, this is where things get fun and interesting still. So these are not my hosts. This fleet was on a older operating system that we're actively attempting to deprecate. I was also told something at one point about the Hadoop effect, which is where there's a lot of weird complications and a lot of weird corner cases that come up when you spin up a massive data store fleet. That's okay. So, again, it was just on a small subset of fleet of the hosts and the only thing that we could find specifically in common about those, about that group of hosts was that they were missing a bunch of stuff from LD config. So next comes the question, what do we do? We end up throwing them in the hole and wiping them off and seeing if it comes back, which is what we did to begin with. Now, we went through the explorative process because we wanted to see what is in common. Was there something systemic? We couldn't necessarily find anything systemic. So sometimes, you know, a horse is a horse and the duck is a duck and a broken host is a broken host and that's fine. This is one of the fun things that you get to do when you operate a large enough service is treat hosts as semi-disposable. So we were able to just reformat the hosts, reinstall them with the same slowly being deprecated operating system because that's where they were and they wanted to be homogenous on their fleet and that's cool. But the problem hasn't resurfaced ever since. So, yay, that's fun. All right, so some lessons learned from this. Logs spelunking is generally gonna be the right place to start. Now, of course, you ask some of the questions about like, is this still really an issue? Is this, does this issue actually exist? But eventually, but very soon you're gonna get pointed at logs. I also learned a whole lot about what goes on with shared objects and with C types on a host. This was all still new to me. And that's why I wanna come share it at this level because don't think that anything, don't be afraid to ask the dumb questions. I had to ask the dumb questions and now I'm sharing about it in public. This is great. So I learned all about how the shared objects become available and I knew all about Python paths so that LD cache totally became an alias to me and so now I get to interpret that. And then, of course, the common lesson that we have at scale is sometimes it's just as easy to throw the host away. But that's not always the right answer. So that was story number one. Story number two, we're gonna talk about PLS being silly. So we have, I'm trying to remember what specific issue that was now. The fun of not looking at a talk again for about half a year after. So we had a new client that we were spinning up. We were turning up our services in the Edge Pops, which are specifically isolated. There's distinct firewall rules so they can do very limited things. So we had to do some very custom stuff so that our client could still make the TLS connections back home to fetch secrets to be able to serve the few things that ran out in the Edge Pops. But TLS was not terminating in our test environment. So as far as we knew, everything in the code was working and everything with the actual TLS configs that we were loading was working. So we're also trying to turn up mutual TLS for all of our API calls, top to bottom, wherever possible. Still not quite so easy with Python but we are in the act of fixing that, yay. Our primary API calls, all TLS, not necessarily MTLS. And what we're doing is from our remote services we have to go through a different service discovery path. And that's really where this particular TLS function was being, was TLS connection was happening was to the outside service discovery system. So my first thought is, well, since that outside API system really expects MTLS, let's try and set this up. So I start looking at exactly what you need to do using the request library to set up your certificates. And I say, okay, I can do that. And then we say, okay, you gotta give it a verification pass and you put all this stuff into your request library. And you say, okay, I can do that. Put all this stuff into your client. So here is where we actually set up our client for talking to our service discovery from outside of our primary data center. So that way you can service discovery from Edge Pops. So I add those few lines and those few fields in to pick up our certificate information, set up the configuration fields. So that way I could just put those paths in our config, send them off to the hosts. And the hosts, because it's how we run our service. We put all the paths in the configs so we can change them when we need to. And then I set up my configs to load those certs as expected. Makes sense, right? So let's try it on a test. Boom, SSL cert errors. I'm looking all through my code. I'm looking through how all my certs are set up and I'm saying, okay, the certs are valid. They're in the right spot. And I try to connect and it still belts out errors. I even step out of my code a little bit and I try to do this using the standard open SSL libraries. So, wait, yeah, that's getting ahead of my slides here. So I try to set this up using the standard SSL libraries as well from the CLI. I load all my certs. I'm able to establish a connection. Everything looks fantastic. But still, my actual API library is not working. So I keep walking, yeah, and there's where I actually show that I'm testing it. So before I actually get to my solution, I wanna talk just a little word of warning. Managing TLS certificates is not easy. Managing certificates at scale is really not easy. A significant portion of data breaches and issues with leaking information in the public all have to do, often have to do with certificate management issues. There's of course things where somebody just forgets to reset an admin password from admin admin. But off times, it's just because somebody didn't rotate a cert properly or accidentally committed something to GitHub that should have been secret. So this I know is something that we at Twitter have been working our tails off to make sure we do specifically correct. And that's also why we manage service that we do because this now gives us a model for doing this professionally, programmatically, and all other kinds of good words. So, and that's just a quote from the article, yay. All right, so back to the issue at hand. So I verified that all my certs are on the host. I verified that they work. I verified that they work using common open S cell libraries. And when I just write up a raw Python client, I can load them up properly. So what's going on? So there was a couple of issues. First off, it was an undocumented feature that our outside service discovery system needed a whitelist. And it was never documented because their customer list for that feature set was literally one person. And that was the person we were providing certificates for. So I can accept that. It's like, okay, no big deal, right? But let's take that a step further. So when I'm looking at how to do TLS, the basics are I wanna know where to put the files because as far as I'm concerned from managing a host, they're files. They need to be accessed by a service. They need to be put here, owned by the right people. And then I need to drop the code on and make sure it works. So I started doing that. Dug a little bit further. Again, found everything worked. Found everything worked. Found everything worked. So, oh yeah, oh gosh, going back. So there was one issue in particular where in my test hosts, this certificate name did not specifically match the data center that I was coming from. You'll see at the end it says TY03. So this was a test cert that I provisioned for a Tokyo pop but I was testing this from our primary data center. When I tried to match everything through our list of matching certificates, it wasn't in there. So eventually I figured out to put that in and try and set that up. But this gets even better. So when I added the verification fields and all that, everything was working just fine from the test host. Yeah, go back. Everything was working fine from our test host as was set up with the appropriate certificates. So our root cause ultimately being that our certificate name was not properly in our certificate chains. So if I had put my certificates out for security review, which my sister team, the InfoSec team actually does, they would have caught this. They would have helped me find this out better. So definitely it's important to reach out for help. And if I learn more about what I need to be doing with my certificates and got better at what I'm just supposed to be doing, then I could have been better at this position. But again, we're all sitting here and trying to learn stuff. So when I actually moved this off of the test host into production, the issue reared up again. So when I picked up my service, threw it over into an actual Edge Pop, nothing was connecting right and the logs from my own API were not being very helpful. So as far as I could tell, the configs were working, the codes were working, everything was consistent top to bottom. So what was going wrong? So back to this little snip of code here. What went wrong was these lines of code here. So my config was hierarchical. So you'll notice that I was referring to tls.certPath and my config file says resolver tls.certPath. So in my test environment, we were failing open the whole time. And that's why you test in multiple ways possible, as many multiple ways as possible. So once I figured that out and we got the configs working right, then everything connected properly. I know I'm jumping ahead of myself here. I'm going the wrong way. So our bugs here, like I said, that the config code that we were writing was meant to inherit dictionary-like functions so we could walk through a config path and actually do a get. And the config files that we're collecting were effectively absent once they were out there. So they were failing open. They were not really resolving tls.nl because they couldn't even find the files. We're just doing a get, which a get in Python returns a none. So there were other typos that I found while I was solving that problem, which I am too embarrassed to put on this slide. So just when you think that something's gonna look great once you put it into production doesn't actually mean it's really running great because I went through a whole battery of tests before I found out that my config was a giant mess. So once we corrected all the paths, we started testing it actually out there on the remote side. Everything worked just fine. So it is still important for us as SREs to know what's going on inside our application, which I know goes against what I started my talk with saying today. So going back, we, I had default actions that were going on that made my test environment favorable. From inside our data center, we don't have to TLS for surface discovery. And I was able to expand our log entries to show when these kind of TLS issues for surface discovery came up because the next time something out in our edge pop is unable to terminate TLS, now we can figure it out sooner. So you do your best to try and line up your test environments, both whether it's in different environments or whatever to try and figure out what can be going wrong. Try and add what's a reasonable amount of logging to your code too much and you'll end up slogging through it and not finding anything right. And then even after you get all of your diffs committed and get everybody to review it, just double check over everything because I'm feel like I probably would have caught some of this stuff if I was more thorough about looking at my own code. And I could have saved myself hours of troubleshooting later. So, in closing, which is still plenty of time left for us all to have a nice long break and second lunch if we want it. All right, not everyone knows everything. Again, I learned a ton of stuff about how to manage TLS in production, how to manage a Linux host in production, all along the way. I picked up all this unique bits of tribal knowledge just from talking to service owners that I was partnering with to get things working right. And being able to have those conversations was critical. And as much as I could, everything that I've implemented, I tried to write down in our corporate wikis, that way it's discoverable by other teams, which is important. Not everything is always gonna be what it seems. That's also why logging explicitly, everything that you can is important and having good logging functions. We're adopting Splunk now at Twitter and it's been amazing, we absolutely love it. Because it helps us get better insights into what's going on in our logs. So if we have huge amounts of logs going over the fence, it's still not as big of a deal. Environments are different. So what looks like one way in one data center or one working environment, as soon as you drop it over the fence into the other environment, things could still totally misbehave, which is why you don't ever wanna just say, it works over in Jenkins and assume that it's good enough to throw into production. And code can express itself differently in those different environments. So test on all of your different operating systems that you run in production. Test on all of your different flavors of Python or Java or whatever your local interpreter is. And just keep trying, just keep digging. You're not gonna exactly know everything about what you're looking at, but what you know, you can still lean on to try and dive deeper to figure out what's actually going on. And we got tons of time for questions and heckles. So thank you. And we got a mic to throw around if anybody wants to ask anything. Or if not, I'm sure there's a bar open across the street. Well, I'm happy to hang around for a little while longer if anybody just wants to come up and say hi. Check, check. Well, thanks for sticking around for the last talk of the whole thing. I appreciate it. Appreciate you being here. My name's Alan Ott and I work for Softiron and I'm gonna talk to you today about software-defined storage with SES. Is it too loud in here? Perfect, okay. So a little bit about me first. I work in platform software at Softiron. Softiron, or at Softiron we make data center appliances for storage, transcoding and lots of other fun stuff. Stay tuned for that. We're currently shipping hyperdrive storage clusters running SES. So these are already going out the door and they use the software that I'm gonna talk about today. I've done work in open source in the Linux kernel. I write firmware. I've given training. I've done a lot of work in USB and one of the things I do is I created and maintain Mstack, which is a USB stack for PIC microcontrollers. And I've also done work in 802.15 for wireless in the kernel, but it's been some years since that. So let's talk about software-defined storage and what it is first. So really when we talk about storage, there's really only one question. And the question is, how do we not lose our data, right? So in a world where everything's going to fail at some point, hard drives fail, whole servers fail, RAID controllers fail and take out all the drives that are attached to them. And where hardware is mostly crap, but sometimes it's maybe not. Where we have a race to the bottom in manufacturing because we wanna be the cheapest thing on New Egg or wherever it is that you're buying your hardware from. We've got closed-source firmware that comes from, well, who knows where, right? I mean, sometimes from China, sometimes from the United States, you know, who knows what's buried in there. Closed-source firmware on your RAID controller, that's a little scary maybe. Maybe in the hard drive controller, also equally scary. And the SSD controllers, right? I mean, this is where we've maybe seen the most issue, right? So with all the ware leveling and all the patented algorithms and all the complexity that goes into making NAND flash, something that's actually reliable, sometimes that stuff gets wedged and your data is wedged along with it. So in that kind of world, what's the answer for storing data? And the answer is software, right? So what we're gonna do is we're gonna decouple all of the critical logic in storing your software from the hardware, right? So we're gonna make data integrity not dependent on hardware integrity. So we can put hardware that may fail and we can run software on top of it that'll just handle all of those failures, right? So in that case, we'll have no single points of failure, right? So your RAID controller, that's a single point of failure, right? We don't wanna have that. And we wanna have a system that's resilient against hardware failure. So redundancy of disks, so data redundancy, right? Redundancy of servers. So we have clusters of storage servers. And we want something that's gonna be self-healing as well, right? So when hard drives fail, we want the cluster to recover on its own without having manual intervention. And of course, we also wanna be able to migrate to different hardware at some point when we decide that we don't like the hardware that we have. So the extended answer is not just software, but software you can trust, right? So the people in this room at this conference, for us it means free and open source software, right? Software where we can inspect the source code. Nobody actually does, right? Anybody in here actually inspect their source code? So like open SSL, you know how it works top to bottom. And there's no problem at all, right? So that's an important thing, right? It's not just good enough that you have access to the source code, but you need to be able to trust the source code. And how do you trust the source code? Well, you can't inspect every source code that you've ever received, that you ever run. But if you have source code that's developed by a vibrant community, right? Where you know people in the community, you know people on the mailing list where you've met people at conferences and you can build up, maybe not any of those people have seen all of the source code, right? I mean, who can read and understand every line of code that's in the Linux kernel, right? I mean, nobody, but you can know enough people who know enough parts of it that you can build a level of trust, right? You can't just assume that there's many eyes on the code. You wanna know who those eyes are, right? And who those people are that are working on it. So you want something that's developed in a community way. You want something that's also widely deployed, right? Something that's tested, well-tested and proven, right? You want something that's scalable ultimately because our cell phones take higher resolution pictures every year, right? And we need to store more and more data. So when hardware fails, software needs to have you covered. So backing up a little bit, right? And going a little higher level, right? We have three classes of IT infrastructure components, right? Compute, networking, and storage. Well, on the compute side, this is a solved problem, right? We run Linux as our OS. Maybe some people use BSD, I don't know about that, but we have solutions for that and we know what we're doing. Networking, well, stay tuned. But in storage, this is also a solved problem that you might not know yet, but the solved problem here is software-defined storage running Seth. So what is software-defined storage? Software-defined storage is storage that's going to be spread across multiple computers and hard drives. So a cluster, we call it, of storage. It's completely controlled by software, right? So there's no hardware controllers that's managing the redundancy or anything like that. All the redundancy and all of that kind of stuff is managed in user space completely, right? So there's no kernel components either in the cluster itself. So there's not any trickery. I mean, you have all the protections of running in user space that the kernel gives you. And so you want something also that's going to present itself to the clients as a single resource. So you have a cluster, it's a lot of computers, but it's one logical resource, right? You have software that's treating that whole thing as one logical resource. So many computers, many hard drives, single logical resource. Or if you want multiple logical resources, it's up to you. But you don't have to do all of the maintenance and all of that kind of stuff yourself. So fault tolerance, resilience. You want to be fault tolerant against server failure, drive failure, and you want real-time automatic recovery. So self-healing like we talked about. We want no downtime for maintenance. When hard drives, when whole servers fail, we want the cluster to do the right thing and be able to run maybe in a degraded mode for a while or something like that, but staying up, right? Still serving whatever it is that we need to serve. Another thing is replacing hard drives either as they go bad or waiting until it's convenient. If you have a cluster that's big enough and when you start doing division on meantime between failure for each hard drive, hard drive failure isn't a contingency, it's a maintenance item. If you have several racks full of hard drives every week, every month, you're gonna have a certain number of drives fail. And that needs to be just something that is handled automatically. And so you don't have to run into the data center all the time and replace each one, right? You want them to run degraded and then at the end of the week, you want to send an intern in there with a crash card, right? And just replace all of them at once. So in software defined storage, you know, we've been doing this long enough that three main interfaces have kind of emerged as the interfaces to the client. And we'll talk about all three of those things and how to use them. So the first is object. So object is typically the native format of a software defined storage system. It's useful for open stack and object aware clients. What does this mean, object? Well, it's basically a data store that's storing key value pairs, right? So no SQL maybe is one way to think about it. You know, you have a key and you have value that goes with it. You set the value for that key and you read the value for that key and it comes back and that's it. All just very simple. So that's kind of like the basic level of data storage. On top of that, we have block. And so block is basically a virtual block device. So a block device in Linux and Unix systems is a hard drive, right? Hard drives are implemented as block devices. Well, on a software defined storage system, we can create virtual block devices. So these are suitable for attaching to a client machine or they can be used as virtual hard drives for VM environments, right? So you can have a VM server and it's serving up all of these VMs and the hard drives for those virtual machines are all coming off of your software defined storage cluster. In a block, for block data, the client manages the file system. So it can be any file system. You could put XSS, X4, whatever on there. Of course, you'll have to format it, right? But the file system is managed by the client. And then the third type of interface is file. So file is a network file system similar to NSS but with true high availability. So the file and how it's different from block is that multiple clients can mount the same file system at the same time, much like NSS. But we'll talk about how file is better with software defined storage than NSS. And of course we have to have redundancy, right? And so all data needs to be stored in a redundant manner. And so this is a requirement for fault tolerance against hardware failure, right? If a hard drive is gonna fail, in order to still have the data, it has to be somewhere else too, right? So we'll talk about different types of redundancy as we go. So replication and erasure coding. We'll talk in detail about how those work. So let's talk about Seth now. So Seth is a free and open source software defined storage platform. It was originally part of a university research project by Sage Weil in 2006. It was further developed at Ink Tank. Ink Tank Storage by Sage and his team. And that company was eventually purchased by Red Hat, which is now eventually purchased by IBM, right? So Seth was designed from the beginning to run on commodity hardware. So that includes servers, hard drives, and of course Ethernet. So all of the basic stuff. And it's also, it's released under the LGPL license. So you really have maximum freedom with it. So let's talk about some of the terminology with Seth. The first is RADOS. And RADOS is kind of a high level description of Seth's architecture. And it's kind of a term that's a little bit overloaded like many terms in Seth. It's not really always a lot of discipline about vocabulary, but that's okay. I think most projects end up that way. So RADOS stands for Reliable Autonomic Distributed Object Store. And really it's Seth's architecture. And we'll talk about what the architecture is as we go along. The other is Crush, the Crush algorithm. Crush stands for Controlled Replication Under Scaled Hashing. So this is the algorithm and the map which determines how data is stored across the nodes of a cluster, right? So a client writes data to a cluster. You know, it's writing a value for a key. It doesn't really care what hard drive it goes on, what server it goes on. That's all determined by an algorithm. And that algorithm is called Crush. We'll talk about how that works. So Crush allows the cluster to scale, rebalance, and recover. And again, we'll cover this more. So Seth releases are named alphabetically. So like many software projects open source these days, we have a number and we have a name. And as we know, all the cool kids use the name. Sometimes they only use half of the name to just show how with it they are. So with Seth though, it's kind of interesting because the version numbers are somewhat nonsensical to my mind. And so everybody pretty much just refers to the releases by name all the time. So some recent releases, they're all named after sea creatures. So we've got jewel kraken, luminous, mimic, and nautilus. We can see some of the features that were added in each one and we'll go through those. So mimic is the current latest stable release and nautilus is in development. But there's still a lot of deployments running on older versions. And there's not really necessarily any reason for that because it is possible to upgrade from one to the other without losing your data. And you can even upgrade individual servers at a time as you go through your cluster. So you don't even have to have downtime to upgrade a major release. So we'll talk about each client interface now. So multiple interfaces to the client object, like we talked about, object is typically just called RADOS in Seth. And then there's block, which is called RBD, RADOS block device, and file, which is more simply called SethFS. So you'll see these terms kind of used interchangeably. So a single storage cluster can provide all three of these interfaces at the same time. So that means you can use a single cluster to implement your file storage, your disks for your VM server, and also your open stack backend with object storage and whatever else, all at the same time. So this is one of the things, one of the real points of value that Seth brings. Is it supports object block and file all from the same software, all from the same cluster running at the same time. So let's talk about objects or RADOS. So object, like I said, is a simple key value data store. In order to use RADOS, what you typically do is you link with the Lib RADOS library, and that has bindings for C, C++, and Python. And you write programs that use this library and you talk directly to a Seth cluster. And so you store data in a pool in whichever way makes sense for your application. You know, it's your application, you figure it out, right? So when you do that, this backend becomes Seth specific, which may be good or bad depending on your point of view, but one of the advantages of doing that is you're using the fastest method to access the data, right? So it's the lowest latency, it's the less load on the cluster, the least load on the cluster, and so you have the most client throughput by using Lib RADOS. So Lib RADOS is great for high performance computing applications or custom applications that need the most performance. And typically the way you'll split this up is you'll have one or more pools per application. We'll talk about what pools are. The next is RBD with block. So an RBD is an exported virtual block device. So it's like a virtual hard drive. RBDs can be used natively by Linux. So the Linux kernel has a driver that can attach one of these devices, and then after that you'll create a file system on it and then you'll mount it and you use it just like a regular hard drive that were attached to your system. So much like a physical hard disk. And of course the client's kernel is what manages the file system on it. So a file system can be anything that's supported by your kernel and meets your use case. So block devices can also be directly accessed through LibRBD. This is another library that comes with CES. And LibRBD is a client library that can access these RBD disks directly from the client's user space. So that means there's no kernel mounting necessary. And that makes this good for VM servers. So for example QEMU has a CES backend which uses LibRBD to talk to these disks directly. So now you've got user space to user space, communication. You don't have to attach any of them to your running kernel or anything like that. You get the best performance that way for your VMs. So anything that uses QEMU, LibRBD, OpenStack, CloudStack, and others can make use of that. So that's a really nice thing. The third interface is File, setFS. So File is a POSIX compliant file system. Very similar to NSS in that you mount your setFS file system. So you just, right on command line, you mount dash t, setFS, and then you give it the IP addresses and your credentials, and you mount it and you talk to it just like it's any other file system. So the benefits though of using setFS instead of something like NSS is that you have no single choke point, right? So NSS, if you've ever used NSS and tried to deploy it in a high availability way, you know that the protocol itself was not designed for high availability. It was not designed for any kind of redundancy or anything like that. It was designed when servers were big and expensive and when the server went down, you had much bigger problems than your NSS, right? So with, like I said, no gyrations or hacks required to get high availability. And of course, like I said before, the difference here is that multiple clients can mount the same file system like NSS and they can all be using it at the same time. And once mounted, the idea that you're on a storage cluster is completely transparent to clients. So this is really good for, you know, generic data storage or backup or also for software that isn't Seth aware, right? Or isn't any kind of particular storage system aware. It just uses files, so we'll give you files. You can alternatively mount it with views from user space. Of course, that's not gonna be quite as fast. SetFS was really the long-awaited feature and it became stable and jewel in 2016. This was kind of what the project was always headed to, right? This is kind of the thing that everybody needs on some level, you know, object is great, block is great if you're writing your own software, but you're not always writing your own software, right? So setFS, this was the real killer feature for Seth. So there are still some limitations. So for example, only one single file system is supported from your cluster. So when I say supported, I mean that, well, multiple file systems is there and it works, but it's marked experimental. Documentation says that there aren't any known bugs, but still use it at your own risk. And I bring this up to say that, you know, this is kind of some of the evidence of the discipline that's used in the Seth project for, hey, if we're gonna mark something stable, you know, it's gonna be really stable. You know, it's gonna be something that you can use, something that's been proven. So you can work around this. You can use path-based authorization to partition off clients. So if you had multiple clients that you wanted to use setFS, you could separate them out by directory and then give each one access to its own area. So only having a single file system is something that can be worked around. Another thing is that there are quotas, but quotas are cooperative, which means that quotas are implemented on the client side. So if you had a malicious client, for example, you could get around quotas. So it's something to keep in mind in your data center. How much do you trust the clients? What are the clients are? Do you control the clients? Things like that. Quotas also require a 4.17 or newer client kernel. And so one thing I didn't say here, much like for RBD, there is a kernel module that will mount these set file systems directly. So you mount it directly from the kernel, just like you would any other type of network file system. So let's talk about set architecture. So we'll talk about the hardware architecture and then the software architecture and then the data architecture. So from a hardware perspective, it's very simple. It's multiple computers that are networked together. Typically there's a minimum of four. So we'll talk about what those four are later. And typically you have three networks. So the networks that you have, there's a public, often also called the front side network. And this is used to communicate directly with the client. So if you have a client and you're storing data on a set cluster, you'll talk directly to this front side network and you'll talk to all of the nodes of the set cluster. They don't go through one node and it does the distribution to the other nodes or anything like that. You talk directly to the servers that have the disks, the OSDs that we'll talk about where the data lives. So then there's also a private or backside network and this network is used for cluster internal operations. So the storage nodes all talk on this network. The clients don't talk on this network. And this is used for replication, for rebalancing and recovery. And then typically you also have an out of band management network. So an IPMI network where you talk to the BMC. Of course, you know, this is optional but it's very convenient. And in addition to that, the private network or the backside network is also optional. You can do everything with one network if you wanted to and CEP will do all of its communication over the front side network. So if your network constrained, you know, you can always do that. If you run out of ports, you know, maybe in a pinch you can just go to one single network. So let's define another term here that's a commonly used CEP term. That's OSD, so it's object storage device. It's somewhat of an overloaded term again. And at a high level, we're gonna use the term OSD to refer to what is basically a single disk. Now this can be different in certain situations but, you know, for high level, for right now let's just call an OSD one disk. In addition, so now we've got an OSD, that's one disk and we'll have OSD nodes now. OSD node or a storage node. So at Softiron, we just call these storage nodes. We don't call it OSD nodes just to eliminate some of the confusion. So this is a computer that has lots of disks on it. Right, so that's our storage node has many OSDs. In addition to that, we have management nodes. So management nodes typically have no storage hard drives and they run monitor and administration software on them. So let's talk about what this software is. So there's a handful of services that run on CEP clusters on the various machines. The first one is CEP OSD. So CEP OSD manages the OSD. There's one instance of the OSD process per OSD, right? So typically per disk. So OSD will listen on the network and it'll communicate with the clients who will send and receive data from the clients. It'll perform the actual reads and writes to the physical disks and it'll also manage replication. So the meat of where most of your CPU time is spent on a CEP cluster is in CEP OSD. This does the bulk of the actual work. So in addition to CEP OSD, we also have CEP MON. So CEP MON is the monitor process. The monitor process manages the crush map. So the crush map is again the algorithm and the map that goes with it that's generated by the algorithm that tells what pieces of data go on which OSDs across the cluster. And of course the crush map is going to change whenever the cluster changes. So whenever you lose a drive or whenever you add a drive, the crush map has to change. And the monitor process is what manages this change. It also communicates the crush map to the client. And so in a CEP cluster, you must have either one or more than two monitors. And the reason here is you typically want to have more than one. You can just have one, but now you've got a choke point. You've got a single point of failure. You need to have more than two though because in a situation where they become out of sync, they can vote and figure out which one is out of sync. So you really want to have either one or three. Three is the typical number that you will have or that you want to have. Another one is a CEP manager. So this just collects the state of the whole cluster and the statistics of the whole cluster in one place. Only one manager is actually required. Then there's CEP MDS. So this is the metadata server for CEP SS. So if you're going to have a CEP SS pool, you need to have a CEP MDS process. So this manages the mapping from the POSIX semantics of the CEP SS file system to the objects in the object store. So this is what kind of converts pass and file names and all of this kind of stuff into objects that actually get stored in the cluster. So you only need one of those if you're going to use CEP SS. Another thing you'll see is RADOS gateway. So RADOS gateway implements a gateway or a translation between S3 and Swift APIs. So if you have applications that are already coded to use S3 or Swift, you just turn on this RADOS gateway and you talk directly to it and now you're using CEP without very much effort. So this is of course optional, only if you're using S3 or Swift applications. So how is the data architected? So data on a CEP cluster is organized into pools. So pools are essentially logical partitions for storing objects. You'll have multiple pools per cluster. Admins can create or delete pools at runtime. So a pool is basically just where some data exists that one application typically will use. So each application, if it's RADOS or if it's RADOS, typically you'll have a pool per application. There's typically one pool that manages all of the RBDs. So that's pretty easy to set up and then for CEPFS, you'll have a data pool and a metadata pool that manages the entire CEPFS. Each pool has its own storage profile and we'll talk about the storage profiles later, replication and erasure coding. And so basically you can support completely different classes of data on the same cluster with different pools. And this is one of the benefits of CEPF. So placement groups, anybody have the fun of dealing with placement groups yet? So a placement group is a logical mapping to a set of physical OSDs where an object can be stored. And I'm not gonna talk about placement groups. This would be a great topic if we had another hour to talk about it. But I would point you to the documentation at the CEPF website on placement groups. There's a very long document about what placement groups are and how to compute the number of placement groups that you need. Suffice it to say that placement groups are unfortunately an implementation detail that has made its way all the way to be user-facing, unfortunately. So in some subsequent releases, placement groups are going to be better handled and then eventually they're going to be handled automatically. But right now it's still something you have to deal with. But we just don't have time to deal with all of that today. So let's talk about a couple of storage back-ins that are supported. So you might see references to these. So CEPF supports two storage back-ins. One of them is a file store. So this is the older back-end. And this stores data on OSDs in an XFS file system. So it stores the data as XFS files. It requires a journal drive for best performance. File store is now the old one, right? And everything now is moving toward a blue store. So blue store is the new back-end. It writes data to the hard drive directly. So there's no intermediate file system. So it writes directly to the block device on the OSD disk. There's still an SSD recommended for the write-ahead log and for the blocks DB database. So this is if you're using spinning hard drives. If you're not using spinning hard drives, you don't need an additional SSD, of course. So we'll actually know we won't. So with blue store, and this is one of the edge cases about when we say an OSD is one hard drive. Well, really, it's one hard drive. And if you're using blue store and it's a spinning disk, you really want to have a little piece of an SSD also with it. So that'll just give you the best performance. You don't have to have that if you don't want to. And it'll perform nearly as well in many workloads. But it is something that's still recommended. So blue store is faster, and it's recommended over file store. So for any new set deployment, you definitely want to use blue store. But you might see file store on older clusters. So of course, as far as hard drives go, there's always a consideration, do I want SSDs or do I want hard drives? SSDs are going to, of course, be faster and more expensive, a lot more expensive, and consume less power. Hard drives are going to be cheap and dense and use a little bit more power. But if you use hard drives, of course, you're still going to need an SSD in the system for the right-of-head log and the block CB. So on our soft iron clusters that do spinning hard drives, we'll have 12 hard drives in one chassis, 12 spinning hard drives, and we'll have two SSDs that go along with that. So that's a ratio that seems to serve us pretty well. So one SSD can serve as the right-of-head log and block CB for several hard drives. So for example, like in ours, we have 12 total, two SSDs. So one serves six drives, and the other one serves the other six drives. And also, it's best to use the same type of solid state disk or hard drives throughout an entire cluster. You can do stuff. You can get creative with JBod-type configurations. It's just not going to work as well that way. And you're going to spend a lot more time scratching your head. So if you're building a cluster up from scratch, put the same type of hard drive all the way through it, same type of SSD or hard disk drive, same size, and all that kind of stuff. So let's talk about storage profiles now. So SAS supports two different storage profiles. So yeah, question. You mentioned in one slide that there's no kernel component. And subsequently, you said there is a kernel component. So which part of this architecture features needed a kernel component? Right. So the question is, in one slide, I said that there is no kernel component. And then in another, I said that there is a kernel component. So there's no kernel component on the cluster side. So on the storage cluster itself, you don't need any special drivers or anything like that to implement a SEP cluster. All of these, all those SEP processes that we talked about, the MON and the OSD and all of that stuff, all that stuff runs purely in user space. On the client side, though, if you want to mount a SEP file system as SEP-SS, that's of course going to need a kernel driver in order to do that. The same way that NSS does, the same way that mounting an actual physical disk requires a driver, there is a driver to mount the cluster's virtual hard drive or file system. So does that make sense? Two storage profiles. So the first is replicated. The next is erasure-coded. So the first storage profile replicated. Replicated pools are very easy. So if you want to have redundancy with a replicated pool, basically you're just storing multiple complete copies of each block that the cluster is storing on the cluster. So multiple complete copies, that's your redundancy. It's very easy to implement. It's very easy to understand. And it works really well. So in the typical case, 3 is kind of the standard and it's also the default. So that means each piece of data on the cluster is stored three times. So it's stored on three different OSDs. And that's actually three different OSDs on three different servers by default. And that's just all adjustable. We'll talk a little more about that in a minute. So this implies now if you're using this triple replication that way, that your cluster can actually store one third the amount of data that's available on the physical disk. So if you have a one petabyte cluster, that means that you're using triple replication on all of your pools, you can store essentially a third of a petabyte. But that's the price you pay for redundancy. Freeway replication like this is fault tolerant. So it means that you have to lose three OSDs at the same time in order to actually lose any data. And like I said, set by default we'll store each copy on a separate OSD and then OSDs on separate nodes. So this guards against whole node failure. So if you had a system where you stored it on three different OSDs that were on the same box, you lose the whole box. Well, now you've lost your data, right? Your redundancy didn't really help you. So these failure domains are actually configurable too. So in addition to being on a separate box, you can configure it to store on a separate rack. You can have it store on a separate row on a separate data center, et cetera. And if you wanted to do that, if you had a cluster that was big enough, you could configure the cluster to say, hey, these boxes are part of this rack. These are part of this row. These are part of this data center, et cetera. And then you could set up policy in your replicated pools to say, hey, I want stuff in this data center. I want it also in this data center. So whatever kind of replication you want, whatever kind of redundancy you want, and of course are willing to pay for, right? And in multiple data centers and high-speed interconnects between them, you can configure CES to give you. So for replicated pools, what happens when you write? So we're gonna talk about writing, reading, and recovery. So with writing, when you write to a replicated pool, the client will make a write to one OSD over the public network. So the first thing it'll do actually is it'll consult the crush map. It'll figure out where it needs to write to. It'll write to that first OSD. So it only writes one time on the public network. So then the OSD process will write the data to its disk, the disk that it's managing, and then it will replicate that data to the other OSDs that are part of the same placement group over the backside network. So with triple replication, your client will write one time to the first OSD and then that OSD will replicate to the other two OSDs. So that means that for writes in three-way replication, twice the work is done on the backside network as is done on the front side network, right? So you send the data once and then it replicates twice on the backside. Yep. Yeah, so the question is, is the backside writing done in a synchronous or asynchronous way? And that's configurable. By default, it's synchronous. So everything in a CEP cluster is designed to be, is designed to preserve data integrity, right? So by default, it's synchronous. So you write and it writes it to its own disk first and then as soon as that's done, after that's done, it'll write to the other disks and you won't get anything back until it's written and committed to all three disks. That's right. And so the triple replication being the default is the reason for a minimum cluster size being three. And we're talking about for production kind of stuff here. You can make a CEP cluster with one, right? But you don't really have any of the benefits of the fault tolerance that you get. So if you lose one, do you write it as a hang because you don't have the key one to write to? Yeah, so the question is, if you lose one of the nodes and you try to write and it's going to try to get, are the rights going to hang if you don't have three separate storage nodes to be writing to? The answer is no, but your cluster will go into a degraded mode. It'll be warning you. It'll say, hey, we're writing data and we're doing the best we can, but we're not able to meet the minimum constraints that you've specified. The question is, is there a timeout for that? Like a timeout for noticing that a server has failed, something like that? Yeah, so if you try writing to a server and it doesn't work, then it's going to know that pretty quickly, right? And then it's going to mark that server as being offline and it's going to try to create additional replications on the servers that do exist. And if you only had three and you're going down to two, now you're in the degraded mode of, hey, we can't even get this to three different servers, but it'll still work. It'll still work and it'll keep chugging along, but your management tools will tell you, show you a warning basically, right? So the question is, during the rebuild, how much in the way of resources are decreased? Well, it depends on the size of your cluster, right? The bigger your cluster is, we talk about scale out, right? The bigger your cluster is, the better everything works, right? Rebuild is one of those things. So you have a lot more bandwidth, front side and back side to work with. So the question as always is it depends, right? In addition to that, rebuilds happen at a much slower rate than reading or writing to the cluster. And the reason for that is because the cluster wants to keep the client side network operating as fast as it can. So rebuilds just happen slower and you can adjust how much slower you want them to operate. So if you want stuff to rebuild fast, you can do that at the expense of bandwidth on the public side. So there's a lot of knobs that you can turn, flighters maybe, depending on your UI I guess. All right, so let's talk about reading. So in a replicated pool, reading is much easier than writing, right? And the reason is because you only have to read it from one stored copy. You've got three copies, you've only got to read it one time. So you basically just got to communicate with one OSD, the one that has the data, and read it straight off of there. So the client reads directly from the public or front side network. And reads are typically also going to be split across multiple OSDs. So if you're writing to a file on SFFS or something like that or multiple files, those are going to be split across lots of different OSDs. Of course, that's going to increase parallelism and increasing parallelism in this case is going to increase performance. And of course, because of this, for replicated pools, the backside network is not used at all for reading. So it's kind of fun to do performance testing and watch things like ISTOP on your networks. And you can really see things like that. You'll really see the network be completely idle. So the third case is recovery. Well, recovery is pretty easy in a replicated pool. Whenever an OSD is lost, after the crush map is recomputed, data is simply re-replicated from an existing copy that already exists. So if you have three copies, you lose one. After it recomputes a crush map, it just copies from one of those onto another one. And now you're back up to three. So no drama. So replicated pools are really good for general purpose, read more than write use cases. So writes are kind of slow. So you got to write three times, reads are fast. So let's talk about erasure coding. Erasure coding is the other type of storage profile. So erasure coding, in the general sense, is a method of encoding data such that it's resilient against erasure of part of the data. So erasure coding is something that's very commonly used in so RAID 5 and beyond. And it uses the read Solomon algorithm. So the way it works is that data is split effectively or evenly into K different what we call shards. So if you have a block of data, it's gonna be split up K times. And so in addition to that, we'll then generate mathematically M different additional shards of parity data. So then all K plus M shards are going to be stored. And then the original data can be regenerated or recovered as long as you have any K of the shards. So if you lose some of the original shards and you have the additional M shards, you can take any K of those shards and run the algorithm in reverse, run the math in reverse, and you can recover the original data. So this means you can lose up to M of the total shards and still be able to recover the data. So how is this data stored? Well, each shard is going to be stored on a separate OSD. Of course, you don't wanna lose one OSD and lose a whole bunch of the shards, right? And then now you're, you had all this system for recovering, you can't do it, right? So it's stored on separate OSDs. And performance-wise, performance on erasure coding is actually kind of interesting. So generating the additional M shards is computationally expensive, but for typical erasure coding profiles, we'll talk about some of those in a minute, for typical erasure coding profiles, less total data is actually going to be in the, is actually going to end up being stored on the cluster. So that means you'll end up with better performance on write. So you have to do a little bit more CPU work, but you don't have to do nearly as much disk work. So your write performance is increased. So what about the reads? Well, let's talk at first about the best and the typical case. So in not very many places is the best case, the typical one, but in this case, that's what we have. So the best and most typical case is one where all original K shards are actually available. And that's a healthy cluster. I mean, statistically, that's going to be almost all of the time, right? Well, in that case, you don't have to do any math. All you have to do is read all of the shards, the original shards, and assemble them and send them back to the client. However, there's still a little bit more latency involved with this because since all of the shards are split across multiple OSDs, the first OSD that you ask for the block is going to have to request all of those shards. It's going to do this over the backside network and then it's going to assemble all of those shards and then it's going to send them back to the client. So this is a higher latency operation because you've got to go to one OSD, the OSD has to ask for the others and then it's got to come back to the client. So you're going to have less total read speed in addition to that because you're doing more smaller reads versus one large read, right? So for a replicated pool, you just read the block directly from one OSD and you're done. Here you have to read it from multiple and you're doing smaller reads. So the question is, you don't have any master metadata, something telling the client where all the data is. That's right. So all of the storage profiles whether it's erasure coding or whether it's triple or whether it's replicated are all handled by the cluster itself. The client has no knowledge necessarily of what a storage profile is. So you don't need to know. Yeah, it's all handled on the cluster side. Erasure code, recovery. So this is the worst case, right? This is where we have to do all of the math or math. So if some of the K shards have been lost they need to be regenerated. They need to be recovered, right? And they need to do this. They need to be recovered using the remaining K shards that still exist. And then also the M shards that have been generated, the parity data. So this involves inverting some matrices and doing multiplication. So this is a computationally expensive thing. And there's two cases where you're going to end up in this worst case. So that's recovery, right? So what recovery is is where the cluster has determined that a hard drive is bad. So you've lost a couple of shards or one shard in that case. Maybe you've lost a server, you've lost data there. It's going to go into recovery and it's going to do recovery on its own. So what it's going to have to do there is it's going to have to recompute the missing shards and then it's going to have to restore those shards and move them somewhere else to good hard drives. The other case is reading from a degraded placement group. So that is one that's in process of recovery. So it'll handle recovery automatically but if you try to read something that's being recovered well now you're going to have to do this extra work additionally here too. So again, I mean recovery it's just something you have to deal with. The best case is not the ideal case. It's a statistically small amount of the time or smaller amount of the time. But it's something that's going to have to happen. But with the ratio coding you have, did you have a question? Yes, the sharding, is it done based on the device cross-course size for each of these separately or is it a global value that shards should be of size actually? Yes, the question is how are the shards, what's the shard size computed? So I don't know for sure the answer based on what you said, but for each block, for each thing that's stored it's split up K different ways, whatever your K is, however you've configured it. That's what we're going to talk about here. The question is, I think your question comes into like if you have a block of a certain size is it going to be split into multiple logical blocks and then sharded? I don't exactly know the answer to that. I want to say no. And also because of the different types of storage, physical storage, this will have different max cross-course size. So if you shard bigger than the maximum lowest possible way of transporting it to the controller, then you have more chances of having to do it with the coding itself because that's a biosphere, a biosphere, the half size of your block that they fail basically as a pool of blocks, I literally don't have a block. Yeah, so the question is, is the shard size have anything to do with maybe the block size of the hard drive? And I don't know the answers to it on that level. All right, so with the racer coding, you set up your K and M and you set up your K and M per pool, right? So you get to decide what values of K you want to have and what values of M you want to have. Each pool can have different values. And of course also each pool, so you can have some pools be erasure coded and some pools be replicated if you have different classes of data that you're trying to store, right? I mean, you can have some of it that's backup data and you can have some of it that's high performance data on the same cluster, you just have them in different pools. You might have one erasure and you might have one erasure of a different profile. You might have one that's replicated, for example. So different pools for different classes of data. So the question is, do disks belong to a single pool? Not typically, no. You may be able to configure it that way, we don't. Yeah, you'll have multiple pools be represented by one disk, for example. So you might have like 100 OSDs, right? And let's say you have five pools running on that cluster. It's not specifically split that these disks service this pool and these disks service this pool. You may be able to set it up that way, but I don't know if you can and I don't know why you would want to, necessarily. But so the assertion is that the disks will operate differently whether they're erasure coded or whether they're replicated. The disks don't. The OSD process will either erasure code or it will replicate and it's just storing blocks on the disk. It's all managed by the software, remember. So up at a higher level than the disk. But yeah, I suppose it's possible to do that kind of partitioning. I just don't know when I haven't done it. Okay, so four, would it be three, right, Z, right, five, and so on? Yeah, so, yeah, the comment is that this is different than VFS, yes. All right, so how do we select a profile for erasure coding? And basically we're saying how do we select the K and the M, right? So let's look at them separately and then look at them together, right? So for the M, so the M is the number of additional parity shards, right? Higher M, so more M's is going to give you more resilience but it's going to cost you more disk usage, right? And it's going to be slower. And so remember, M is the number that we can lose and still be able to recover. So the more M's you have, the more resilience you have. That's all there is to it. But you have to store more data, you're gonna use more disk. It's gonna cost you more money, right? So the lower the M, the less resilience, less disk usage and also it's going to be faster because you're writing fewer bytes to the disk. So what about K? K is a little bit more involved. So remember, K is the number of shards that a block is going to be split into, right? So the higher your K is, the more shards that it's gonna be split into. Of course, the higher your K is, the smaller the M's are going to be also, right? Because they're all going to be the same size. So if you have a higher K, more blocks are going to be affected by each failure. And when I say each failure, I'm talking about each hard drive failure. You have a hard drive failure, it's going to affect more blocks. So you'll have more things that you have to recover. So that's gonna be a lower MTBF before going into recovery, right? Of course, it will recover, but it's just gonna have to do more work every time it does recover. The lower your K, the data is on fewer OSDs. The shards are each bigger, right? And now, but now fewer blocks are going to be affected by each failure. You lose a hard drive, fewer blocks are going to have to go into recovery. But with a lower K, your M's are gonna be higher also. So your additional parity shards are going to be larger. And then of course, the ratio here, the M to K ratio, the higher the ratio of M over K, the more resilient, but the more disk space you're going to use. So let's look at an example. So typical profile, 6.2 is a good starting point. So that's K of six, M of two. It's oftentimes represented that way, just six comma two. So that means that each block is gonna be split into six shards, and then there's going to be two additional shards that are going to be generated as parity data. All eight shards are going to be stored. And so a system like this is going to be resilient against the loss of any two shards. So that's 25% of the total data that's stored, right? And it's going to use 1.33 times the disk space of the data that you're storing, right? So if you're storing one K, it's gonna cost you 1.33 K of disk to store that, right? So that's not actually too bad, right? That's a whole lot better than the triple replication than what triple replication would use, right? Because triple replication, you're gonna use three times the block size. So five, three, and four, two are also commonly used in CES. And of course, when you set this up, you're trading off performance, resilience, and density for your specific workload. So if you wanted to, you could go all the way to like 28 and two or something like that. You're split up a lot of ways and you're only storing a very small amount of parity data. So, and remember, what you're guarding against here, since the system is self-healing, right? What you're guarding against is the loss of multiple hard drives at the same time or within that window of recovery, right? So you lose one, you go into recovery and before that recovery is over, you lose one of the other ones. And then before that, so now you're kind of at another level of recovery, you lose another one within that window, that's what you're guarding against. Use cases for erasure coding. Well, the use cases here are for right heavy workloads, right, because writing is going to be faster because you're writing less data than with a replicated pool. So right heavy workloads are gonna be better. So a lot of times there are statistically right only workloads, right, which also need density. So a backup is a good example of this, right? Users wanna be able to backup their data quickly. And in many cases, they don't really care how slow it is to read their backup. They're really just happy that there's any backup at all, right? So in that case, backup is, you know, statistically speaking, a right only workload. So right only workloads are perfect for erasure coding. Another thing is scientific applications where the data is generated and it's written very quickly and then analyzed offline later over maybe a period of weeks and months. You have an experiment that runs in a few seconds, you collect tons of data and then you write research papers for the rest of your life with it, right? Just a little plug for a soft iron, okay, so. In the previous slide, can you get crazy and use both of them at the same time if you really want more? Is it because the erasure coding is actually proportionally correct, whereas the application is three times would prevent from a complete storage control of radio type stuff? Yeah, so the question is, can you combine erasure coding and replication? If you can, it's not something that I've ever heard of or read about, but you really don't have to, right? So if you went back and you wanted to have some combination, I guess, so to speak, you could generate that with different levels of K and M on your erasure coding, right? So let's say you wanted to have something that was like triple replication on erasure coding. So what if you had a K of six, but then an M of 12, right? That would essentially be using the same amount of data. It's almost nonsensical to go that far, though, right? So I mean, I think that might kind of answer your question. Combining the two is not really something you'd really want to do. But what you can do, though, if you did really want to do that, let's say you had, and this is something we didn't talk about, but you can have multiple clusters that replicate one another completely, right? So you can have something in one data center and then in another data center and you can set up policy so that everything that's on this one has to also be on that other one over there, right? And so you set up, that's what I was saying, you can set up the failure domains, right? So between OSD, RAC, or OSD server, RAC, you know, RAC, ROW, data center, et cetera. I'm thinking of the replication where when you write the first primary copy and when that primary copy is used to replicate the other two copies, that primary copy is written incorrectly, like partial write or partial read that stuff. Then this erasure code can help the other two copies to be run. Yeah, so the question is, in replication, when you write the first copy and it synchronizes that to the first disk, if something were to fail in that case, then you're saying you would- Then that copy would be used to do the other two replications, not the original one that came into play. As you're saying, yeah, so if that wrote fail or if that write failed and it was corrupted, then what would be written to the other, what would be replicated? Well, so I haven't looked at the code for this specific thing, but what I would imagine is what gets written to the other OSDs would be correct because it's gonna write it from its original RAM buffer, not from what it's gonna read back off the disk, right? So, yeah, so I guess maybe at a higher level, there are many different types of hardware failure, right? And so one of the things Seth does is it protects all of the data that's written with stuff like strong hashing, for example. So you know if something gets written incorrectly and then you'll know when you try to read that back. Actually, what you'll know where you'll really know that is when it does a scrub. So you can set up an interval to do a scrub and when it does a scrub, it just checks all of the data on the disks, it checks all the hashing and it checks and just makes sure that everything's consistent. So if you start having failures like that, you can find them hopefully in a scrub, but you'll also find them when you try to read back. So when you try to read back and the hash is wrong, well then you know that's bad. You mark it bad, you go to one of the other ones, you try to find it and say, okay, well now we need to move this data and replicate it again because now we're down to two, for example. Two minutes, I think I'm being told. Oh, is that what you're saying? Okay. All right, so I'm all the way to the end here. Just a little plug for us, the soft iron advantage of using some of our hardware for Seth. So soft iron makes storage clusters for Seth. We only sell whole clusters. Seth runs on commodity hardware like we've talked about, but wouldn't it be better to have a vendor that knows about Seth, that can talk Seth with you, right? An integrated solution for hardware, software, OS. We find a lot of people, they come to us after they've tried to do with themselves several different ways, right? First is white box, second is something from Supermicro and then after that something from Dell or HP or whatever and they still haven't really solved all their problems and then they're talking to us, right? And so we can deliver a completely integrated solution, hardware that's tuned, not just tuned, but designed from the ground up for your use case, right? Designed for storage, designed to do the things that Seth needs to do, right? We know about having hard spinning disks and SSDs and we built them just that way to have the right balance for that. Things like that. Updates that don't break your system. We have our own Linux distribution and when you do an app update on it, we've tested running Seth that's on there the way that you're running it, right? You do it yourself, maybe you got Debbie in or Ubuntu or something like that and you're pulling Seth maybe from somewhere else, pulling it from upstream and you end up with a combination of things that haven't quite been tested right. We're testing it exactly the same way you're running it. We've got administration tools to make it easier, we're rolling out some pretty exciting stuff this month so you don't have to be a Seth guru to do some of the basic tasks, adding servers, adding drives, doing the basic maintenance. We also provide gateway services for common protocols so one of the things we didn't talk about here is that your customers show up and they say, yeah well we have Windows and we don't have a Seth SS driver in our Windows. We need NSS, you have this beautiful distributed system and you want to use NSS on it, but sometimes you need to do that. NSS, SMB and we have services for implementing all that stuff in a high availability way so you can just turn it on and now you've got high availability in NSS without very much drama and support that's familiar with your use case, right. We know what you're trying to do. We can have spares stored on site, hot spares in the rack and a lot of flexible payment options. Also, we're hiring so Rockstar is welcome. Send some email and so that's it for me. I think I got one more question. Go ahead. Is there LDAP integration? I do not know. Yeah, we didn't really talk about authentication in this. That's kind of a whole separate thing, but yeah. Have we used it in HPC environments? How's it being set up? Yeah, question is, is Seth used in HPC environments and the answer is yes. Yeah. As our comments are CERN, UCSD, yeah, plenty of others. Yep, questions? My question is, is there any connection?